Contents

Anscombe's Quartet: why you should visualize your data

July 21th, 2018 anscombes quartet

If you looked at the graphs on the image to the left, you would think that the underlying data points wouldn't share much in common. However, all three data groups share the same mean value. That's right, the average value for these three disparate visual representations are all the same. Not only that, but they also have the same Pearson correlation coefficient.

Pretty neat, right? Well, Frank Anscombe, an English statiscian and the creator of the Anscombe Quartet, wanted to take his example even further: these three graphs also have the exact standard deviation as well. That's right, the three visually different graphs on the left have the same correlation coefficient, mean, and standard deviation!

In this post I'm going to provide some coding examples that definitely prove these figures and plot them as well. Astute readers may notice that there are, in fact, only three different graphs. Anscombe's Quartet has four data sets, with three sharing the same X variable. I have decided for brevity and simplicity to only calculate the three datasets that share the same X variable. Hopefully, by the end of this post, readers will get a full appreciation for how much can be intuited from a graph, even when important statistical measures are exactly the same.

Pearson's Correlation Coefficient

Below X represents your individual item in a list/array, X Bar is the mean for the group, Y is the individual item in your second group, and Y Bar is the mean for the second group.

Let's convert this to a function in python


import numpy as np

def pearson_coefficient(x,y):
    n = len(x)
    x_sum = sum(x)
    y_sum = sum(y)

    x_squared = []
    for xi in x:
        x_squared.append(xi*xi)

    y_squared = []
    for yi in y:
        y_squared.append(yi*yi)

    xy = []
    for xi,yi in zip(x,y):
        xy.append(xi * yi)

    xy_sum = sum(xy)

    correlation = ((n*xy_sum) - (x_sum*y_sum))/(np.sqrt((n*sum(x_squared)-(x_sum**2))*((n*sum(y_squared))-(y_sum**2))))
    return correlation

Next up let's calculate the standard deviation

Below X represents your individual item in a list/array, X Bar is the mean, and n is the number of items in the list

standard deviation formula

def calculate_standard_deviation(list):

    x_minus_xbar_squared = 0
    n = len(list)
    average = sum(list)/n

    for X in list:
        z = (X - average)**2
        x_minus_xbar_squared += z

    return np.sqrt(x_minus_xbar_squared/(n-1))

The two functions above should give us all we need to list out and run our code:


import matplotlib.pyplot as plt
plt.style.use('dark_background')

x = [10,8,13,9,11,14,6,4,12,7,5]
y1 = [8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68]
y2 = [9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74]
y3 = [7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73]

print(f'coefficient for y1 is {pearson_coefficient(x,y1)}')
print(f'coefficient for y2 is {pearson_coefficient(x,y2)}')
print(f'coefficient for y3 is {pearson_coefficient(x,y3)}')

print(f'standard deviation of y1 is {calculate_standard_deviation(y1)}')
print(f'standard deviation of y2 is {calculate_standard_deviation(y2)}')
print(f'standard deviation of y3 is {calculate_standard_deviation(y3)}')

print(f'mean of y1 is {sum(y1)/len(y1)}')
print(f'mean of y2 is {sum(y2)/len(y2)}')
print(f'mean of y3 is {sum(y3)/len(y3)}')

plt.subplot(221)
plt.scatter(x,y1)
plt.title('y1')

plt.subplot(222)
plt.scatter(x,y2)
plt.title('y2')

plt.subplot(223)
plt.scatter(x,y3)
plt.title('y3')

plt.show()

Below is the out from the console:


coefficient for y1 is 0.816420516344843
coefficient for y2 is 0.8162365060002422
coefficient for y3 is 0.8162867394895953
standard deviation of y1 is 2.031568135925815
standard deviation of y2 is 2.0316567355016177
standard deviation of y3 is 2.030423601123667
mean of y1 is 7.500909090909093
mean of y2 is 7.500909090909091
mean of y3 is 7.500000000000001

Process finished with exit code 0

Result

As you can see the results are the same up to two and even three decimal places for y1,y2, and y3. Below is the output from matplotlib verifying that the data structures, despite their similar stat ratios, are indeed different.

Back