Statistics 101

Joe Herald
4 min readJul 3, 2021

Overview of fundamental concepts of statistics.

Statistics is one of the fundamental skills required in most professions, especially in the academic field. Collecting, processing, analyzing data is in the core competency of the field. When the scientist conducts an experiment, they use statistics to mine the valuable information from the data. Now we are in the era of big data. We are now in a data-driven society. In future, the demand on statistical skills will remain high.

Let’s learn some of the basic knowledge regarding statistic 101.

Photo by Carlos Muza on Unsplash

Statistics is an art of dealing with data. There is a vast amount of data everywhere. When we are conducting the analysis, the first thing to do is to collect data. The data is collected from a pool of our target data called the population. There can be millions or billions of data that is related to our research. So it is not feasible to collect and analyze all of the data because it could take a month or more just to collect and process the data. Thus, a sample of the population is needed.

Sample is a specific group drawn from the population whereas sampling refers to a method to acquire the sample. The method used to collect only a small but impartial amount of data from the population that can be used just for the analysis purpose. For example, to do a survey of work satisfaction of the people in one city, it is impossible to ask everyone to do the survey. Therefore, choosing a few workers from various professions, background, gender, age etc to do the survey not only saves time but also can give high accuracy feedback and feasible answers to the survey.

Do notice that there is a plausible sampling bias that the data collected cannot represent the overall population behavior and lead to a flop in result. Sampling bias can happen when looking for convenience, under represent, over represent etc (We will talk more about this hopefully in future). These sampling biases restrict the generalization of the collected data. This can be disastrous to the whole analysis process.

Photo by Campaign Creators on Unsplash

Good sampling is a collection of a good amount of data that can represent the population. One of the good sampling methods is called stratified random sampling. It is a method of sampling from a population that is categorized into few sub populations based on time, features and environment. It is advantageous to always do sampling using this technique to prevent sampling bias. For instance, in an industry, to do stratified random sampling on a product, the sample of the product should be collected randomly from different time frame, machine, supplier ingredient or operators.

After the collection of data, we need to conduct basic analysis to acquire understanding from the data. Two main types of measures are center measure and spread measure. Both types of measure are useful to have insight about the distribution of data.

Center measure is critical. Center measure can provide an insight into whether the collected data are well distributed. Examples of center measure are median and mean. Mean or x-bar is to understand the average value of the total amount of sample data. To get the mean, divide the sum of total data value by the number of data amounts. Median, however, is to get the center value of the distribution which can easily get by the value located at the middle position of the total sample sorted data.

Photo by Luke Chesser on Unsplash

With a combination of spread measures, we can fully understand the distribution of data. Spread measures like standard deviation, range, maximum value and minimum value are also crucial in understanding data Standard deviation aka sigma shows how the value of each data deviated from the mean value. This show how consistent is the data. If the standard deviation is low, the data is showing consistency. Variance, square of standard deviation is another spread measure. In quality control, standard deviation play an important role. one-sigma consist of 68% of total data, two-sigma is about 95% and three-sigma 99.7%. Three-sigma refers to processes that operate efficiently and make products of the top quality. Range, another kind of spread measure, used also to observe the widespread use of the data and easy to detect if there is any outlier.

Furthermore, these calculations can be presented in a statistical way or graphical way. Often, graphical presentations of the data provide more insight of the data and help analysts to discover more from it.

In short, statistics is fun and yet colossal. Espousing statistics is a must in not only manufacturing industry but in all kinds of industries to achieve high productivity and efficiency. The future of a statistician has never been so bright and promising as today.

--

--

Joe Herald
0 Followers

Science & technology enthusiast, providing insight of real life problem.