Basic statistical concepts

Why statistics?

Kareem Carr via dddrew.com

Because they’re EVERYWHERE

We live in a world of randomness

Statistics help provide some strucure to the randomness

and allow us to extract useful information

Data are inherently random

If you collect a sample from a population

Then collect a second sample from the same population

They are very unlikely to be the same observations
(or measurements)

However some observations will occur more
frequently than others

So we can assume that we will observe an outcome a certain number of times

if an experiment (or analysis) is repeated
(even if we don’t actually repeat it)

This allows us to

estimate parameters of a
population

without tediously studying the whole population

and the more experiments we conduct

or the more data we collect

the closer we get to the true parameters
(as the number approaches infinity)

This is known as the Frequentist approach

And is the more common approach in most of science
(for now…)

A statistic is a single value describing a collection of observations (data from a sample)

Since these observations are not predictable, we can call them a random variable

A statistic is often assigned the generic symbol \(\text{T}\)

A random variable is assigned \(\text{X}\)

Statistics describing random variables, are themselves random variables (more later)

For example, measurements taken from a sheep’s astragalus

https://doi.org/10.1101/2022.12.24.521859

could be considered random variables

Let’s collect some measurements

sheep-data.csv

We can describe them in simpler terms
using descriptive statistics

We can calculate a central tendency of the data

The arithmetic mean (or ‘average’) of the GLl variable is: 30.34

It is calculated as

\[ \frac{1}{n} \Sigma_{i=1}^{n} GLl_i \]

Which is a fancy way of writing: all measurements of GLl added together (sum, or \(\Sigma\))

then divided by the number of measurements n, or sample size

Try it in R

sum(sheep_data$GLl) / length(sheep_data$GLl)

[1] 30.33866

mean(sheep_data$GLl)

[1] 30.33866

This is not very useful information in isolation

GIPHY

We need more context

A question we might ask is

how much do the data
vary around the mean?

We can calculate the difference between

the mean of our sample (\(\bar{x}\)) and each observation (\(x_i\))

and sum

\[ \sum(x_i - \bar{x}) \]

But that would essentially give us zero

because there is roughly an equal number of measurements below and above the mean

(hence it’s a measure of central tendency…)

so we square the result to remove negative values

\[ \sum_{i=1}^{n}(x_i - \bar{x})^2 \]

which gives us the
sum of squared differences

But

larger samples

will have

larger differences

making it difficult to compare different-sized samples

So we divide by sample size, n (minus 1), to standardise

\[ s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n - 1} \]

and obtain the variance
\(s^2\)

Try it in R

sum((sheep_data$GLl - mean(sheep_data$GLl)) ** 2) / (length(sheep_data$GLl) - 1)

[1] 2.24968

var(sheep_data$GLl)

[1] 2.24968

Because we’re squaring the differences, the numbers can quickly get unruly

So we can take the square root of the variance

\[ s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n - 1}} \]

to get the standard deviation
\(s\)

Try it in R

sqrt(var(sheep_data$GLl))

[1] 1.499893

var(sheep_data$GLl) ** (1/2)

[1] 1.499893

sd(sheep_data$GLl)

[1] 1.499893

Combined with the mean

standard deviation can say more about our data

But for the whole sample this is not very informative

We could calculate summary statistics across multiple groups

But how do we know if the groups differ (or not) in a meaningful way?

We could run some statistical test

but which one?

The correct answer: It depends…

Primarily it depends on what do your data look like?

R for Data Science

Descriptive statistics only get you so far…

z3tt/TidyTuesday

We need to explore the data