Statistics is one of the most important pillars of data science. To succeed as a data science you need to be aware of different concepts of statistics. 5 such important statistical concepts that you should know are as follows:-
- P-Value: P-value is actually the probability of getting a sample like ours or more extreme than ours. if the null hypothesis is true. A small p-value makes you reject the null hypothesis. And if the p-value is large we will accept the null hypothesis. P-value basically tells us how likely it is to get a result like this if the null hypothesis is true.
- Central Limit Theorem: If you take multiple random samples from an underline population and look at a frequency distribution of sample averages, this sample averages will be distributed normally even if the underlined population is not normal. What this means is that in a large dataset, independent random samples will tend towards approaching a normal distribution on the whole. Central Limit Theorem is important for data science as it lies at the heart of one of the most important techniques in the field; hypothesis testing. CLT is often used to normalize the data and calculate something known as confidence intervals to further clean the data and derive better insights.
- Hypothesis Testing: Hypothesis testing is a statistical technique used to make decisions based on data. In it, we set the null hypothesis and the alternative hypothesis. The null hypothesis means, there is no difference in data and alternate hypothesis means, there is a difference in data.
- Probability Distribution: A probability distribution is a depiction of all possible outcomes of a random variable and their associated probabilities. Probability distributions for a variable describe how the likelihood of an event occurring in a set of random variables.
- Confidence level: Confidence level is basically a 1 – significance level, and it is used to show how confident you are about your conclusion. For eg. If the null hypothesis is rejected at a 5% level of significance, you are 95% confident about your conclusion. A 95% confidence interval is a range of values that you can be 95% certain contains the true mean of the population. This is not the same as a range that contains 95% of the values.