One sample & Two Sample t-tests in Python
Sebastian Wright
I have a doubt here on how to work this. New to the world of Stats and Python. A student is trying to decide between two Processing Units. He want to use the Processing Unit for his research to run high performance algorithms, so the only thing he is concerned with is speed. He picks a high performance algorithm on a large data set and runs it on both Processing Units 10 times, timing each run in hours. Results are given in the below lists TestSample1 and TestSample2.
from scipy import stats
import numpy as nupy
TestSample1 = nupy.array([11,9,10,11,10,12,9,11,12,9])
TestSample2 = nupy.array([11,13,10,13,12,9,11,12,12,11])Assumption: Both the dataset samples above are random, independent, parametric & normally distributed
Hint: You can import ttest function from scipy to perform t tests
First T test One sample t-testCheck if the mean of the TestSample1 is equal to zero.
- Null Hypothesis is that mean is equal to zero.
- Alternate hypothesis is that it is not equal to zero.
Question 2Given, 1. Null Hypothesis : There is no significant difference between datasets 2. Alternate Hypothesis : There is a significant difference Do two-sample testing and check whether to reject Null Hypothesis or not.
Question 3 - Do two-sample testing and check whether there is significant difference between speeds of two samples: - TestSample1 & TestSample3
He is trying a third Processing Unit - TestSample3.
TestSample3 = nupy.array([9,10,9,11,10,13,12,9,12,12])
Assumption: Both the datasets (TestSample1 & TestSample3) are random, independent, parametric & normally distributed
21 Answer
Question 1
The way to do this with SciPy would be this:
stats.ttest_1samp(TestSample1, popmean=0)It is not a useful test to perform in this context though, because we already know that the null hypothesis must be false. Negative times are impossible, so the only way for the population mean of times to be zero would be if every time measured were always zero, which is clearly not the case.
Question 2
Here's how to do a two-sample t-test for independent samples with SciPy:
stats.ttest_ind(TestSample1, TestSample2)Output:
Ttest_indResult(statistic=-1.8325416653445783, pvalue=0.08346710398411555)So the t-statistic is -1.8, but its deviation from zero is not formally significant (p = 0.08). This result is inconclusive. Of course it would be better to have more precise measurements, not rounded to hours.
In any case, I would argue that given your stated setting, you do not really need this test either. It is highly unlikely that two different CPU perform exactly the same, and you just want to decide which one to go with. Simply choosing the one with the lower average time, regardless of significance test results, is clearly the right decision here.
Question 3
This is analogous to Question 2.