Statistical tests are the only way in quality and manufacturing to provide objective evidence for decision-making. They help identify variations in processes and distinguish between random fluctuations and actual problems. In engineering, statistics help identify patterns, outliers, and sources of failure in system performance, ensuring data-driven decision-making. By rigorously analyzing experimental results, engineers can validate product designs and manufacturing processes, detecting potential problems before implementation. This systematic approach reduces the risk of unexpected failures and enhances overall safety by ensuring reliability and compliance with international safety standards.
This post will review main statistical tests used in manufacturing and Total Quality Management (TQM).
Note: as they also concern engineering, research and science, the following 2 statistical tests and analyses
- correlation analysis: measures the strength and direction of the relationship between two variables (e.g., Pearson correlation coefficient).
- regression analysis: examines the relationship between variables (e.g., input factors and process output), from simple linear to multiple regression.
are not included here but in a specific article about main 10 algorithms for engineering.
Normality Tests

in the statistical tests world, many common statistical methods (t-tests, ANOVA, linear regression, etc.) assume that the data are normally/Gaussian distributed (or that the residuals/errors are normal). Violating this assumption can make the results unreliable: p-values can be misleading, confidence intervals may be wrong, and the risk of Type I/II errors increases. Note that some tests, like the 1-way ANOVA, can handle reasonably well a non-normal distribution.
Note: if your data is not normal, see real life cases below, you may need to use non-parametric tests (like the Mann-Whitney U test or Kruskal-Wallis test), which don’t assume normality, or transform your data, which are out of the scope of this post.
While several statistical tests exist for this, we will detail here the Shapiro-Wilk test, famous especially for small sample sizes, typically n < 50, but can be used up to 2000.
FYI, other common normality tests:
- Kolmogorov-Smirnov (K-S) test (with Lilliefors correction): works at better with larger sample sizes while being less sensitive than Shapiro-Wilk especially for small datasets
- Anderson-Darling test: is good with all sample sizes and has more sensitivity in the tails (extremes) of the distribution while being more powerful for detecting departures from normality in the extremes.
How-to perform the Shapiro-Wilk normality test
1. Calculate or compute the Shapiro-Wilk test statistic (W): W = \frac{\left(\sum_{i=1}^{n} a_i x_{(i)}\right)^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}Note: as the calculation of the a_i coefficients is nontrivial and generally requires a table or algorithm, which is why the Shapiro-Wilk test is nearly always computed by software such as R, Python’s SciPy, MS Excel add-ons or other dedicated softwares. For a manual calculation, this page provides all the a_i coefficients and p-value for samples up to 50. The value of W ranges between 0 and 1 (W = 1: perfect normality. W < 1: the further it is from 1, the less normal your data are). 2. W is not enough. It works in conjunction with its corresponding p-value to have the confidence level. In the Shapiro-Wilk table, at the row of the n sample size, look for the closest value to your calculated W and get its corresponding p-value on the top | The numerator represents the squared sum of the weighted ordered sample values. The denominator is the sum of the squared deviations from the sample mean (i.e., the sample variance, scaled by (n-1)). x_{(i)} = the i-th order statistic (i.e., the i-th smallest value in the sample) x_i = the i-th observed value \bar{x} = the sample mean a_i = constants (weights) calculated from the mean, variances, and covariances of the order statistics of a sample from a standard normal distribution ((N(0,1))), and depend only on n (sample size). n = sample size |
3. Result: if the p-value is greater than the chosen alpha-level (exemple 0.05), there is statistical evidence that the data tested are normally distributed. |
For normality testing, it is frequently advised to mix a numerical method with a graphical method such as Henry’s line, Q-Q plots or histograms :
Mind Non-normal Distributions!
While normal/Gaussian distribution is the most frequent case, it should not be automatically assumed. Among daily counter-examples are:
- Wealth and income distribution among individuals. It follows a Pareto (power law) distribution, skewed with a “long tail” of very wealthy individuals.
- City population sizes in a country follow Zipf’s Law (power law), with a few very large cities and many small towns.
- Earthquake magnitudes and frequency are a power law/Gutenberg-Richter distribution: small earthquakes are common, large ones are rare.
- Daily price changes or returns in financial markets: fat-tailed/heavy-tailed distributions, not Gaussian; large deviations occur more frequently than predicted by a normal distribution.
- Word frequencies in language, as the city population above, it follows a Zipf’s Law (power law): Few words are used often, most words are rare.
- Internet traffic/website popularity: power law/long tail: Some sites have millions of hits, most have very few.
- File sizes on computer systems: log-normal or power law, with a few very large files and many small ones.
- Human lifespans/longevity: right-skewed (can model with Weibull or Gompertz distributions), not normal; more people die at older ages.
- Social network connections follow a power law: few users have many connections; most have few.
Most of these are characterized by “few large, many small”, a signature of power laws, heavy tails, exponential or log-normal distributions, and not the symmetrical shape of the Gaussian.
The t-Test (Student’s t-Test)
The t-Test (aka “t of Student”), developed by William Sealy Gosset under the pseudonym “Student” in 1908, is a statistical test used to compare means when sample sizes are small and population variance is unknown. Focusing at comparing the means of two populations, it is one of the most used test in Manufacturing.

Purpose: the t-Test helps engineers and quality professionals determine if there is a statistically significant difference between the means of two groups or between a sample mean and a known standard. It’s commonly used in hypothesis testing to evaluate whether process changes or product modifications have led to real improvements or differences, beyond what could be expected by chance.
Practical examples in the industry:
- In automotive manufacturing, a t-Test might be used to compare the tensile strength of steel from two different suppliers to ensure consistent quality.
- In pharmaceuticals, the t-Test is used to analyze whether a new production process yields tablets with a mean weight significantly different from the standard.
- In electronics, engineers may use the t-Test to verify if a design change in a circuit board results in a measurable improvement in electrical resistance.
How-to the Student’s t-Test
They are many variants of the t-test; the example here will focus on a so-called “two-sample t-test” in its “unpaired” version, comparing the samplings of 2 different productions batches.
- State your null and alternative hypotheses; in this example “there is no difference between means” vs “there are different”
- Collect your data from the 2 production batches being compared and calculate
- the 2 sample means \bar{X} = \frac{1}{n_1} \sum_{i=1}^{n_1} X_i and \bar{Y} = \frac{1}{n_2} \sum_{j=1}^{n_2} Y_j
- Calculate the 2 sample variances: S_X^2 = \frac{1}{n_1-1} \sum_{i=1}^{n_1} (X_i - \bar{X})^2 and S_Y^2 = \frac{1}{n_2-1} \sum_{j=1}^{n_2} (Y_j - \bar{Y})^2
- sample sizes.
- Calculate the test statistic. While the method assumes both samples are independent & both samples are from normally distributed populations, there is still two cases:
- if equal variances assumed (“pooled” t-test;): Pooled variance: S_p^2 = \frac{ (n_1-1)S_X^2 + (n_2-1)S_Y^2 }{ n_1 + n_2 - 2 }
Test statistic: t = \frac{ \bar{X} - \bar{Y} }{ S_p \sqrt{ \frac{1}{n_1} + \frac{1}{n_2} } } - if unequal variances (Welch’s t-test): Test statistic: t = \frac{ \bar{X} - \bar{Y} }{ \sqrt{ \frac{S_X^2}{n_1} + \frac{S_Y^2}{n_2} } } Degrees of freedom (approximate, Welch-Satterthwaite): df = \frac{\left( \frac{S_X^2}{n_1} + \frac{S_Y^2}{n_2} \right)^2}{ \frac{ (S_X^2 / n_1)^2 }{ n_1 - 1 } + \frac{ (S_Y^2 / n_2)^2 }{ n_2 - 1 } }
- if equal variances assumed (“pooled” t-test;): Pooled variance: S_p^2 = \frac{ (n_1-1)S_X^2 + (n_2-1)S_Y^2 }{ n_1 + n_2 - 2 }
- Use the calculated ( t ) and degrees of freedom (n_1+n_2-2 for equal variances, or the Welch formula) to look up or compute the p-value from the t-distribution (depending on whether it’s a one-tailed or two-tailed test).
- Result: compare the calculated t-value with the critical t-value from statistical tables based on your chosen confidence level and degrees of freedom; alternatively, use software for the p-value. If the t-statistic exceeds the critical value or the p-value is below your threshold (typically 0.05), reject the null hypothesis.
Link to the t-Test critical values table
The F-Test
The F-test, introduced by statistician Ronald A. Fisher in the early 20th century, is used to compare the variability (variance) between two sets of data, to assess if their population variances are significantly different. In quality and engineering, it often helps determine if process changes or different machines produce consistent results or if new methods affect product variability. Often a preliminary step before applying t-tests and ANOVA on larger comparisons.
Purpose: the F-Test is used to confirm if two processes or samples have the same level of variation, which supports quality control decisions and process improvements. It helps engineers identify if changes (e.g., new machines, suppliers, or materials) impact the consistency or quality of a product.
Industry Examples
- Manufacturing: comparing the dimensional variances of parts produced by two different machines to ensure both machines produce consistently within quality standards.
- Supplier Evaluation: comparing the strength variability of raw materials from two different suppliers to decide if one supplier provides more consistent quality.
- Quality Improvement: testing if a process improvement (like a new calibration method) has reduced the variability in final product weight compared to the old method.
How-to the F-Test
- Collect two sets of sample data (e.g., measurements from process A and process B).
- Calculate the variance for each sample group A and B.
- Divide the larger variance by the smaller variance to get the F-value.
- Result: compare this F-value to a critical value from the F-distribution table based on sample sizes and the desired confidence level; if the calculated F-value is greater, the variances are significantly different. In statistical tests, the variance ratio tests, the degrees of freedom (DOF) associated with each group is the quantity of samples minus one (note that this is different for an ANOVA result comparison).
F-distribution table: link to the F-distribution table up to 15×15 DOF (and online F critical calculator for bigger DOF)
Analysis of Variance (ANOVA)
While the F-test above refers broadly to any statistical test that uses the F-distribution and is used to compare variances or ratios of variances between two or more groups, the ANOVA is a variant that compares the means of three or more groups to see if at least one is significantly different. The ANOVA test was also developed by Ronald Fisher in the 1920s as a statistical tool for agricultural experiments.
Purpose: the Analysis of Variance (ANOVA) is to determine whether there are statistically significant differences between the means of three or more independent groups. In quality, engineering and particularly in Design of Experiments (DOE), it helps identify which factors or processes have a significant impact on product performance or output, aiding robust decision-making and process improvement.
Examples:
- In pharmaceutical production, ANOVA can help compare the effects of different formulation processes on the efficacy of a drug.
- In electronics, it is used to test if the variance in circuit board failure rates is due to different batches of raw materials.
How-to ANOVA in Brief
1. Define the groups or treatments you want to compare and collect data from each group. Calculate
2. Use these values to calculate the F-statistic (see on the right), which is the ratio of variance between groups to the variance within groups. 3. Compare F-statistic to a critical value from the F-distribution table at a chosen significance level (like 0.05). 4. Result: if the F-statistic exceeds the critical value, you conclude that there are significant differences among group means. | The F-statistic: The F corresponds to the Mean Square Between Groups (MSB) divided by Mean Square Within Groups (MSW) Practically: F = \frac{ \frac{SSB}{k-1} }{ \frac{SSW}{N-k} } SSB = Sum of Squares Between Groups |
The Chi-Square Test
The Chi-Square Test, introduced by Karl Pearson in 1900, revolutionized statistical hypothesis testing by providing a method to determine if there is a significant difference between expected and observed frequencies in categorical data. In quality and engineering, it helps assess whether deviations in a process or product attributes occur by chance or suggest a systemic issue.
Purpose: the Chi-Square Test checks if the differences between observed and expected results in quality measurements are due to random variation or indicate a specific problem that needs addressing.
Practical examples in industry
- Manufacturing Defects: checking if the distribution of defective products across different shifts or machines is uniform, and whether certain shifts have a significantly higher defect rate.
- Supplier Quality: comparing the quality performance (e.g., pass/fail rates) of components from multiple suppliers to determine if one supplier’s parts are statistically more likely to fail.
- Customer Complaints: analyzing whether the types or frequency of customer complaints are randomly distributed throughout the year, or are associated with specific times, products, or regions.
How to do the chi-square test
- Collect observed data and determine the expected frequencies for each category under the null hypothesis.
- Use the Chi-Square formula: Χ² = Σ[(O – E)² / E] where O is observed, E is expected.
- Compare the calculated Chi-Square value against a critical value from the Chi-Square table with the appropriate degrees of freedom.
- Result: if the value exceeds the table value, conclude there is a statistically significant difference.
Link to the chi-square critical values table
Chi-Square Full Example: Fairness of a Dice
i | Oi | Ei | Oi−Ei | (Oi−Ei)2 |
1 | 5 | 10 | −5 | 25 |
2 | 8 | 10 | −2 | 4 |
3 | 9 | 10 | −1 | 1 |
4 | 8 | 10 | −2 | 4 |
5 | 10 | 10 | 0 | 0 |
6 | 20 | 10 | 10 | 100 |
Sum | 134 |
This full example is taken from Wikipedia Chi-square article.
Experience: a 6-sided die is thrown 60 times. The number of times it lands face up on 1, 2, 3, 4, 5, 6 is 5, 8, 9, 8, 10 and 20, respectively.
Question: is the die biased, according to the Pearson’s chi-squared test at a significance level of 95% and/or 99%?
The null hypothesis is that the die is unbiased, hence each number is expected to occur the same number of times, in this case, 60/n = 10.
The outcomes can be tabulated as on the right:
Degrees of freedom | Probability less than the critical value | ||||
---|---|---|---|---|---|
0.90 | 0.95 | 0.975 | 0.99 | 0.999 | |
5 | 9.236 | 11.070 | 12.833 | 15.086 | 20.515 |
Looking at an upper-tail critical values of chi-square distribution table (table linked in the how-to above), the tabular value refers to the sum of the squared variables each divided by the expected outcomes.
For the present example, this means χ2=25/10+4/10+1/10+4/10+0/10+100/10=13.4
Conclusion of test: this 13.4 is the experimental result whose unlikeliness (with a fair die) we wish to estimate, with a significance or confidence between 97.5% and 99%
Process Capability (Cp, Cpk, Pp, Ppk)

Not a statistical test per se, these 4 ratios assess how well a process meets specifications, thus becoming a critical tool for maintaining and improving quality standards in manufacturing.
Process capability analysis originated in the early 20th century alongside the rise of statistical quality control in manufacturing, pioneered by figures like Walter Shewhart. Its methods evolved through the growth of Six Sigma and Total Quality Management (TQM) in the late 20th century as a cornerstone of modern quality engineering.
Purpose: process capability analysis assesses how well a process can produce output within specified limits (tolerances). It quantifies the variability of a process relative to design specifications and determines the likelihood of producing defective products. The analysis helps identify opportunities for process improvement and ensures products consistently meet customer requirements.
Cp, Cpk and Statistical Tests in Industry
- Automotive manufacturing: statistical tests and these 4 ratios are used to check whether the diameter of engine pistons remains consistently within tight tolerance limits, ensuring compatibility and reducing engine failures.
- Pharmaceutical industry: applied to verify that the fill weight of tablets or capsules consistently meets regulatory and quality standards, minimizing underdose or overdose risks.
- Semiconductor manufacturing: employed to monitor the thickness of wafer coatings, ensuring reliability and performance in microchip production.
How to calculate Cp, Cpk, Pp and Ppk
Cp: Process Capability
Cp = \frac{USL - LSL}{6\sigma} | USL = Upper Specification Limit LSL = Lower Specification Limit σ = standard deviation (typically estimated from within-subgroup variation) |
Cpk: Process Capability Index
Cpk = \min\left(\frac{USL - \mu}{3\sigma}, \frac{\mu - LSL}{3\sigma}\right) | \mu = process mean |
Pp: Process Performance
Pp = \frac{USL - LSL}{6s} | s = overall standard deviation (includes both within and between subgroup variations; used over a longer period) |
Ppk: Process Performance Index
Ppk = \min\left(\frac{USL - \bar{x}}{3s}, \frac{\bar{x} - LSL}{3s}\right) | \bar{x} = overall mean |
How to Conclude with Cp, Cpk, Pp, Ppk Values
- Cp, Pp: if >1, the process has the potential to meet specifications; values ≥1.33 are generally considered capable, depending o your industry and the criticality of your exact application
- Cpk, Ppk: these reflect how centered the process is within specs; the closer Cpk/Ppk are to Cp/Pp, the more centered the process.
- If Cpk or Ppk <1, a significant portion of output is likely outside the specification; process improvement is needed.
- A higher index indicates a more capable (and usually, higher quality) process.
Conclusion & Pitfalls
Statistical tests are powerful tools in data analysis, but their use demands both strong theoretical understanding and critical real-world judgment and adaptation, far from just a statistical software installation or QMS rules.
- Understanding assumptions & selecting the right test: every statistical test has a set of underlying assumptions (e.g., normality of data, equal variances, independence of observations). If these assumptions are violated or an inappropriate test chosen, the results of the test may be invalid or misleading.
- Real-world messiness & business context matters: industrial data often violate test assumptions (e.g., non-normality, autocorrelation). Blindly applying textbook tests can result in completely misleading analyses.
- Data quality issues: measurement errors, outliers, and missing data are common in industrial statistical tests and must be addressed and documented before testing.
For product design so as for quality, put your effort where needed: “Sometimes, results are statistically significant but have negligible practical impact, or vice versa”
External Links on Statistical Tests for Quality
International Standards
(hover the link to see our description of the content)
Related Posts
Best 100+ IRC Channels for Engineering
Persona Card Creator™
Medical Imaging Software Directory
CAD Software Directory
Weighted MVP Features Prioritizer™
Construction Project Management Software Directory