Correlation analysis includes the "product-moment correlation" dealing with regular numbers such as 1, 2, 3,... and the "nonparametric correlations" on the data expressed in non-numerical categories such as "yes or no," "large, midium and small," and so on. Let us first look at nonparametric ones that are most often used in the cross-cultural theory test. The first is the correlation expressed in a 2 × 2 " contingency " table.
In this cross-tabulation of two rows and two columns with four cells, Cells ( a ) and ( d ) are along with the presumption that weight is influenced by stature, so “hits." But ( b ) and ( c ) do not agree with the view, so “misses." Now, that there is a correlation between stature and weight is that “hits" overwhelm “misses." So, one would count each case of the hits and misses and compare them to know if there is or is not a correlation there.
George Udny Yule ( 1871-1951 ), Scottish statistician, invented a method for this comparison that has since been widely used for a correlation in the 2 x 2 contingency table in the name of "Yule's Q" (*1).
Namely, when there are only hits with no misses, both the numerator and denominator are the same, resulting in Q = 1. When there are only misses, Q = - 1. And when hits and misses are equal, Q = 0, fitting the common definition of correlation coefficients amounting between + 1 and - 1. Now, the data from the Yokohama 79 students produce the correlation by Yule's Q as follows:
The chance risk for this Yule's Q can be obtained by the procedure called the "Chi Square Test" ( χ^{ 2} ).
A Chi Square is a score to indicate the amount of relatedness of the two variables ( stature and weight in this case ). If there is no relationship at all, this score is zero. As the relatedness gets larger, so does the Chi Square.
As we have already seen, a strong correlation does not occur often by chance alone ( , unless the trial cases are too few ). And, as the relatedness decreases smaller and smaller, the chance risk grows larger and larger. Hence, the score of Chi Square is in an inverse proportion to the chance risk " p." This interrelationship has already been calculated mathematically and attached in most statistical textbooks and in the end of this chapter here, too. Here is the method how to calculate the Chi Square.
At first, let us suppose that " there is absolutely no relationship at all between the two factors in question, " a null hypothesis.
We have the data of stature and weight for the Yokohama 79 students that the total number ( N ) is 79 and the heavy students among them are 39 ( E ), or 49.4 percent of the toal. Now, if the above null hypothesis is true, it is expected that heavy students must be found equally as many in the taller subgroup of 41 students as in the shorter ones of 38.
That is to say, in the taller subgroup, the heavy students must count to be 41 multiplied by .494, or 20.2, and those in the shorter subgroup of 38 students must be found 38 multiplied by .494, or 18.8 students. In other words, 20.2 is the expected frequency for Cell ( a ) and 18.8 for Cell ( b ). However, the frequency actually observed in Cell ( a ) is 28, that is 7.8 more than the expected, and we have 11 students in Cell ( b ) which is 7.8 less.
In the mean time, the expected frequencies are calculated as follows: ( when the expected are represented as " a ’ b ’ c ’ d ’" for every cell a, b, c, d. ) ( Or, you can get it by subtracting an expected value, if known, from the subtotal in the margin. )
Now, the Chi Square Test is a test to know whether or not the two thing in question are really of " no relationship at all " by counting how large the data observed are deviated from the scores expected. So, you subtract the score expected from the observed and sum up the balance of every cell. But the balances are of both positive and negative with the sum total being zero. To avoid that, it is necessary to square each raw balance to make it all positive. And, also, you divide the squared balance by the expected value to control the influence of the total number ( N ). The Chi Square is the sum total of the balances thus treated. Namely,
χ^{ 2} = Σ [( X-X ')^{ 2} / X ' ] ( where, X = observed frequency of each cell;
X ' = its expected score; "Σ" refers to sum up )
To seek for the Chi Square value for the above 79 students, it is:
χ^{ 2} = (28 - 20.2) ^{ 2} / 20.2 + (11 - 18.8 ) ^{ 2} / 18.8 + (13 - 20.8 ) ^{ 2} / 20.8
+ (27 - 19.2 ) ^{ 2} / 19.2
= 7.8 ^{ 2} / 20.2 + 7.8 ^{ 2} / 18.8 + 7.8 ^{ 2} / 20.8 + 7.8 ^{ 2} / 19.2
= 12.2139
Then, look at a minute Chi Square Table. And you will find there the largest Chi Square score being around 10.827, and the“ p " ( chance risk ) for that Chi Square is shown to be " p = .001 " at the degree of freedom of one ( a single freedom, because in a 2×2 table such as this, when any one of the cells is given, the rest is automatically fixed ). Our score of 12.2139 is even larger than that, so that we may conclude that the chance risk for this corrrelation is " p < .001. " Namely, the correlation of stature and weight in the 79 Yokohama students is concluded as:
In the mean time, the Chi Square distribution deviates from the normal curve when its score is too small. So, it is the requirement for the Chi Square Test that the expected score of any one of the cells must be greater than 5. |
The Chi Square is, in essence, the number to indicate the amount of discrepancy of the observed data from the null hypothesis r = 0. But this score increases not only by this discrepancy but also by the size of the total number ( N ). When the total number ( N ) is large, the probability for the given correlation to occur by chance decreases for the Law of Large Numbers. And, aslo, when the association of the two phenomena is strong, the chance risk decreases. On those two truisms the level of " p, " chance risk, has been measured by this score.
Since the Chi Square is a product of both the association and the total number of N, the quatient of the score devided by " N " must stand to be the net amount of the intensity of the association " r. " That is the correlation coefficient of PHI ( φ ) as follows:
This φ amounts to unity when there are all hits with no misses and zero when hits and misses are equal in number, fulfilling the nature of " r " in general. Let us apply the data of the 79 Yokohama students to this formula of PHI.
Hence, the association of stature and weight of the 79 Yokohama students is judged as follows:
Thus far, we have been looking at 2×2 contingency coefficients of Yule's Q and PHI, and their chance risk " p " estimated by the CHI Square. The Fisher's Exact Test, developed by Sir Ronald A. Fisher ( 1890-1962 ), English great evolutionary biologist and statistician, is another method of estimating this risk (*2). In this method the chance probability " p " is expressed in an identity " = " instead of inequality " < " like in other methods. This is why it is called an " exact test. " Namely,
where, a, b, c and d stand for the frequencies of the four cells; E = a + b, F = c + d, G = a + c, H = b + d, and N = E + F. And the symbol " ! " is read as " factorial, " referring to a product in such a way as, say, 5 ! = 5 × 4 × 3 × 2 × 1, or 10 ! = 10 × 9 × 8 ...... × 3 × 2 × 1, and the like. The factorial of zero is defined as the unity ( 0 ! = 1 ). Putting the Yokohama data here, we have:
This " p " of .000394453 is so small that is practically the same thing as zero. So we can write down the stature-weight association of the 79 Yokohama students to be:
( In the mean time, the factorial calculation often produces huge numbers, so gigantic as most home computers getting unable to handle. A device to avoid this inconvenience is to use logarithm in which one can calculate a multiplication by addition, division by subtraction in such a way as:
That there is a correlation is that the frequencies in the hit cells overwhelm that of the miss cells in a 2 × 2 table. But it is not so obvious in a more than two division table. How to distinguish hits from misses in a more than 2 × 2 cross-tabulation?
Leo. A. Goodman ( 1928- ) and William. H. Kruskal ( 1919-2005 ), both from Chicago University, devised a method to deal with the correlations for more than 2 × 2 cross-tabulations and jointly published their coefficient in the name of “ Gamma, ” γ in 1954. (*3)
In this, they counted the hits and misses not by the cells but by the individuals to form the total observation. If, for instance, Student A is superior, or inferior, to Student K for both stature and weight, this match is along with the contention that stature and weight are interrelated so a hit. Call these hit matches “ P ." But if A is superior for stature but inferior for weight to K, or the other way around, it is against the contention so a miss,“ Q. "
Now, how to count the numbers of P and Q ? Let us take a look at the following 3×3 table.
In the above table, the cases that one is superior to the partner both in stature and weight ( hits ) are the matches of the one to those in the cells lower than and to the right of the cell it belongs. Those are, at the same time, the matches of inferiority in the partner’s viewpoint. Thus, the total number of P, the hits, is:
P = Cell a × ( sum of the cells lower than and right of Cell a )
+ Cell b × (sum of the cells lower than and right of Cell b )
+ Cell d × (sum of the cells lower than and right of Cell d )
+ Cell e × Cell i
( There are no cells in the right-down-ward direction for Cells c, f, g, h and i )
So,
As to Q, the misses, begin with Cell c down toward lower and leftward.
And Goodman and Kruskal defined γ ( Gamma ) as:
Now, let us divide the stature and weight of our 79 Yokohama students into three sections as shown below and seek for the gamma coefficient.
Our gamma is as follows:
P = 12 ( 10 + 8 + 7 + 15 ) + 11 ( 8 + 15 ) + 10 ( 7 + 15 ) + 10 ・ 15 = 1103
Q = 4 ( 10 + 10 + 2 + 7 ) + 11 ( 10 + 2 ) + 8 ( 2 + 7 ) + 10 ・ 2 = 340
As to the chance risk for this correlation, Goodman and Kruskal do not give the method. The Chi Square is limited to 2 x α cross-tabulations. Kendall's Variance S method is available. But before going there, let us reconfirm the nature of the Normal Distribution here.
The Japanese say of the " acorn's race of stature..., " a parable of useless sticking to small differences. But acorns do differ in size from one to another. Collect many acorns and draw a coordinate graph with the size on the X axis and the number on Y. Then, you will see that many cases gather around the average areas at the center, and the further apart you move from the mean in both left- and right-ward directions, the less many cases you find, drawing a shape like Mt. Fuji of Japan. This distribution is called the Normal Curve, or Normal Distribution.
[ where, σ = standard deviation μ = mean e = base of the
natural logarithsm, or 2.718…]
Not everything follows the Normal Distribution, of course. But the sizes of acorns, of man, and of many other things fit this distribution pattern. Now, all Normal Distributions are symmetric for the left and right halves; When it is cut into half at the summit ( mean ), the area of the left or right portion is exactly one half or .5 ( 50 percent ) of the total area of unity representing the total number N.
There are wide-scattered and gentle distributions while are there others sharp-sloped and projecting at the summit areas. Now, let us standardize the degree of the scattering in an index called "Variance" or "Standard Deviation" ( square root of Variance ), giving larger scores for gentle curves and smaller numbers for sharp ones as follows:
Now, take a distance of a Standard Deviation, or twice as much, or three times... and move on the X axis for that much distance either left- or right-ward from the mean to draw a vertical line there. Then, it is known, the amount of the portion left inside or outside by the line remains the same for all shapes of the distribution, if it fits the Normal Curve.
The moving distance of 1, 2, 3 ... portions of a standard deviation is called the " Standard Score " or " z, " which is counted as:
This area, the portion left by the standard score, has been calculated mathematically and a concordance table of the z to the area of the portion is included in many introductory textbooks under the name of " z Score Table " or " Normal Distribution Table " and attached to the end of this chaper here, too. This table is useful in many parts of statistical analysis.
Incidentally, if you take the distance of 1.96 times as much as the standard deviation in both leftward and rightward directions from the mean ( z = ±1.96 ), the area surrounded by the two perpendiculars at the spots is known to amount to 95 % of the total area. Now, back to Kendall's Variance S.
Sir Maurice Kendall, another British statistician, devised a correlation method applied to cross-tabulations of any numbers of rows and columns, and call it " TAU.... " Let us look at Tau b ( τb ) first (*4).
Kendall’s Tau is the same as Goodman=Kruskal’s Gamma to the point that it counts P, the hits, and Q, the misses in the matches of all individuals under observation of the total of N cases. But Tau differs from Gamma in taking neutral matches also into account.
Those matches extending either horizontally ( the same rank weight-wise ) or vertically ( the same stature-wise ) are neither hits nor misses. But those should also be included in the total number of matches in the denominator. Horizontal matches are called “ Xt “ and vertical “ Yt. " To use the above 3 × 3 table, those numbers are counted as follows:
Then, Kendall divides P ‐ Q by the geometric mean of ( P + Q + Xt ) and ( P + Q + Yt ) and calls the quotient “ Tau b." Namely,
Kendall’s Tau b yields a little reserved scores than Gamma for Xt and Yt included in the denominator. Now, let us get this coefficient on the above 3 × 3 cross-tabulation of our Yokohama students.
It is a reserved score, compared with the Gamma of .529.
Now, next is the chance risk for this association. That is to know how frequently the balance of the hits subtracted by the misses occurs by chance alone. Let S stand for P - Q ( S = P - Q ). It is known that the balance S occurs by sheer chance the most frequently at the point of zero and that the further apart from the point of zero it moves left- and right-ward, the less frequent it occurs by chance to make a normal distribution.
Thence, if the standard deviation of that normal distribution is obtained, divide this observed balance S by it to gain the standard score " z. " By looking at a " z score table, " the portion of the area left aside from the z point is known. That is the probability for the given " r " to occur by chance alone. The standard deviation of the distribution of a given S, or the Variance ( square of standard deviation ) is given, Kendall thinks, as follows:
( where, Σ referring to sum up; N = total; " r " = subtotal of a row. " c " = subtotal of a column. Using the 3×3 table just below the heading " Goodman=Kruskal's Gamma " above, r_{ 1} = J; r_{ 2} = K; r_{ 3} = L. c_{ 1} = M; c_{ 2} = P; c_{ 3} = Q. Σr = r_{ 1} + r_{ 2} + r_{ 3}. Σc = c_{ 1} + c_{ 2} + c_{ 3}. Standard Deviation SD = √Var S )
The above equation yields the Variance of 43721.8 and the Standard Deviation 209.097… when applied to the data of our 79 Yokohama students in the 3 × 3 cross-tabulation. Its z score is 3.649, the quotient of 763 (S) divided by this Standard Deviation. Referring this to a z score table, it is indicated that the area of the portion left right-ward from the parpendecular at this X axis point is less than .1 percent to the total. Thence, it is known that the probability to get this intensity of the correlation or more by chance alone is less than .001. So, we may conclude as follows:
In the meantime, the score of Tau b is the same as PHI when it is applied to the 2 × 2 table.
τb = φ, where the matrix is of 2×2. |
There are, thus, many methods of analysis for the nonparametric associaton. "Which method would you rather recommend us the best?" To this question of mine, late Dr. Raoul Naroll, the present author's old professor, said to him that Kendall's Tau b seemed to be well taken among many social scientists..., a fond memory of an old day. Now, before ending this section, let us take a look at another Kendall's work, the Rank Correlation next.
Morris Kendall's " Rank Correlation " is an analysis on the two rankings so converted from the data once measured by the ratio scale (*4). There are several merits in it.
On the 79 Yokohama students, for instance, rank them as 1st, 2nd, 3rd … for their stature and weight, and dot the spot of the rank in stature on X axis and that in weight on Y axis in an X and Y coordinate table. Then, if the stature and weight are perfectly interrelated, No. 1 student in stature must also be ranked No. 1 for weight, No. 2 as No. 2… to line up on a straight line Y = X, ascending from the left-bottom to the top-right in the table.
If there is no interrelationship at all between stature and weight, on the contrary, the distribution must be random. If there is even an inverse relation between them, the dots should line up on the line inverse to that of Y = X. So it is expected.
With those expectations, let us present here two coordinate tables, one for the stature and weight of the Yokohama students and the other for the male and female suicide ratios of the world’s 84 nations and areas reported by WHO of the United Nations to compare.
As seen in the above tables, the rank correlation of suicide rate between the sexes shows a marked inclination for both sexes to vary together, resulting in a strongest coefficient of r = .757 (τ).
But the relationship of stature and weight of the 79 Yokohama students is not, on the contrary, so evident with not a few tall students rather skinny and the other way around, contradiction due to diet, perhaps, and many other possible reasons. But still, there is a slight inclination noticed toward left-bottom to top-right, resulting in a coefficient of r = .361 (τ).
Thus, there is a merit in Kendall's Rank Correlation that one may see the degree of the relatedness directly by eyes, although some may fear that certain important contents of the ratio-scale data are lost when converted into ranks. Merits and demerits. When used wisely, the Rank Correlation of Kendall has a great merit in it.
Next is the equation to estimate the intensity of this correlation (τ).
Kendall begins with matching all of the 79 students to make a pair with one to every possible another. The number of all possible matches in a population of N members is given as:
( 3 pairs for N of 3; 10 pairs for N of 5; 21 for N of 7... )
In a real population, however, there are pairs not of hits and misses but those between peers ( “ ties “ ). The match between A and B where A is equal to B at one of the two variables in question, say, stature, is neither a hit nor a miss but to be counted as “ T. ” Neither a hit nor a miss is the pair of K and L where both stand equal at the other variable, say, weight, so to be counted as “ U. “
Those T and U must be subtracted from the number of all possible matches in the denominator, Kendall feels. He places, in practice, the geometric mean of [ N ( N - 1 ) / 2 - T ] and [ N ( N - 1 ) / 2 - U ] in the denominator. Namely, Kendall’s Rank Correlation coefficient of Tau (τ) is defined as:
Now, let us find this Tau coefficient for our 79 Yokohama students. But it would take several days of manual calculation. So, let us try here with the first 15 students of them first.
Next is the chance risk for this Tau. Kendall proposes the same procedure as in Tau b to get the Variance ( square of Standard Deviation ) of the distribution of S to find the amount of " p " by referring to a Z Score table. Namely,
Hence, SD = 20.207
Namely, the Rank Correlation of stature and weight of the first 15 students of Yokohama is:
This Rank Correlation of the whole 79 Yokohama students is calculated by computer as follows:
Thus, we have been looking at the correlations of non-ratio scale data such as Yule's Q, PHI, Goodman=Kruskal's Gamma, Kendall's Tau b and his Rank Correlation Tau so far. Now, it is the turn to go on to the ratio-scale associations, Galton's Regression and Pearson's Product-Moment Correlation.
Sir Francis Galton ( 1822-1911 ) was a cousin ( half-cousin ) of Charles Darwin. He was a man of many talents concerned widely in genetics, statistics, anthropology and other areas. He published an article " REGRESSION towards MEDIOCRITY in HEREDITARY STATURE, ”in the Journal of the Anthropological Institute of Great Britain and Ireland in 1866 (*5), seven years after The Origin of Species by Charles Darwin. That was the same year when Gregory Mendel ( 1822-1884 ) published his monumental experiment on peas ( *6 ), and 34 years before the Mendel's work was rediscovered in 1900 and became widely known.
No one is ignorant about that children look after their parents. But the proportions of the roles of father and mother in heredity are not so obvious. Some even say that it is the father that really passes the " seed " to offspring with the mother providing just the " field " for the seed to grow in. Here, Galton thought...
If the father's contribution is 100 percent with the mother's zero, sons of a tall father are straightforwardly tall and those of a short father short. The statures of those sons and fathers show a distribution rising top-right-ward at the angle of 45 degree when those are put on a coordinate table.
But if the contribution is equal for father's and mother's in heredity, what can be expected? It is true that a tall man tends to marry a tall woman and a short man a short woman ( assortative mating ). But very tall men and women are rare and so are very short men and women. So, very tall, or very short, men can hardly find a woman of the same extremity of the stature as their own so that they tend to marry a woman of the stature closer to average, less extreme than they are.
Then, if the mother's contribution is equal to the father's, the angle of the distribution line on the coordinate table should be somewhat " regressed " toward the horizon with the sons taking after their mother's trait closer to the average, a line expressed by an equation:
The " regression " indicated by the coefficient " b " in this equation must tell, Galton thought, the mothers' contribution in this heredity.
However, when taking the data of real fathers and sons, the distribution does not so closely line up on a " Y = a + b X " line but scatter widely in a table like the below. Then, how to draw the line running along the center of the distribution, or how to find the coefficients " a " and " b " in that equation, which was the first task assigned on Galton.
How to draw a line running along the center of the distribution? First, try to draw several lines approximating this requirement. The dots located on the line are O.K. But those scattered above or below the line are to be considered errors from the predicted score by the line. Draw a perpendicular through a dot. Then, the distance from the dot to the point where the perpendicular crosses the equation line is the son's error from the expected stature by his father's. Get the total of all those errors of sons.
However, the errors are both positive and negative, canceling each out. So, take a square to make each positive and sum it up.
When the sum of ( the square of ) the son's error is at a minimum, the regression line predicts the sons' statures with errors at a minimum, a line called the " Least Square Line. " Galton found the equations to get the coefficients " a " and " b " for that Least Square Line as follows:
Or, for the convenience of calculation, it is transformed to:
That much is about the regression viewed by the fathers' statures against those of the sons, the " regression of the fathers against the sons. " And the coefficients " a " and " b " are termed in particular as " a XY " and " b XY. " Then, the same thing must be able to be done with the sons viewing the fathers, the " regression of the sons against the fathers " to get " a YX " ( or simply a' ) and " b YX " ( b' ). Those coefficients can be obtained by replacing " X " by " Y " and " a " by " a' " and " b " by " b' " of the above formulae. Namely,
The regression originally meant, thus, reduction of the father' hereditary contribution to the son. But this meaning was soon lost and the analysis to obtain the Least Square Line itself became to be called the " Regression Analysis. " And in this regression was embedded something to measure quantitatively the association between the two variables of X and Y.
The Least Square Line, or Regression, " Y = a + b X " enables one to predict unknown Y by knowing X, or X by Y. We also know that the distribution of human stature follows the Normal Distribution.
It is, then, possible to predict the unknown stature of a particular son on the basis either of the Regression and his father’s stature or of the mean of the normal distribution of the stature of the sons. Both predictions involve probable errors. Let us call the errors by the mean prediction the “ Original Error, " and those involved in the Regression the “ Errors Unexplained ” by the Regression.
To measure the total amount of those errors, a simple summing up results in zero, for the positive and negative ones canceling each out. To avoid that, it is necessary to square each error to make it all positive and to sum it up. Namely,
Original error = Σ ( Y−My ) ^{ 2 } [ My = the Mean of Y ]
Errors unexplained = Σ (Y−RL) ^{ 2} [ RL = prediction by regression ]
Comparing those two kinds of errors, if the Errors Unexplained is smaller than the Original, then, it can be said that Regression has improved the accuracy of the prediction for that much extent. In other words, the balance of the Original subtracted by the Unexplained constitutes the “Portion Explained" by the Regression. But this portion also varies according to the total size of N.
So, divide this portion by N to get the net amount of the Portion Explained after the influence of N removed. Or, it is the same thing to put the sum total of the Original, instead of N, in the denominator. Namely,
When the errors from the regression is zero, all the data line up on the least square line. The Portion Explained is unity and Y is 100 percent explained by X there. When the Regression errors are equal to the Original, the Portion Explained is zero, indicating that those two variables have 100 percent nothing to do with one to the other.
The Portion Explained is, for this nature, also called "Coefficient of Determination," representing the proportion of the contribution of X ( father's ) on Y ( son's ).
Moreover, the Portion Explained is the same as the product of the coefficient "b" of the equation of the father's regression on son and "b'" of that of the son's regression on father.
Moreover, as we will see in the following section, Pearson's Product-Moment correlation of " r " is itself the square root of this Portion Explained, that is, the Coefficient of Determination, so that the " r " is given as:
Thus, Pearson’s "r" is, in fact, the geometric mean of the Galton's regression coefficients of "b" and "b '."
We have been stating that the square of the correlation coefficient "r" represented the portion explained as the coefficient of determination since in the previous chapter ( Statistics I ). This meaning is, in fact, proved only in terms of this regression analysis, by ratio-scale data, developed by Francis Galton.
Nonparametric correlations such as PHI, Q, Gamma, Tau b, or Tau that we have seen so far are a kind of correlation and resemble Pearson's Product-Moment correlation of "r," on the basis of which the square of those coefficients is inferred to retain a similar meaning as the "portion explained" in the regression analysis. The Regression and the "Portion Explained." In the remote past of 1866, Francis Galton was finding those truisms in this universe.
Now, let us run this regression analysis with the data of the stature and weight of our 79 Yokohama students.
Thence, the "regression of stature on weight" is: Y = ‐ 32.6 + 0.518 X
And, the "regression of weight on stature is:
So, it is X = 133.86 + .494 Y.
Next is its "portion explained," that is, the "coefficient of determination."
Or, in the other words, it is : = b × b ’ = .518 × 0.494 = .255
Thence, the conclusion shall be that: "the human stature is exercising about 25.5 percent of influence on the weight."
There was a man, in London, who was called Karl Pearson ( 1857-1936 ), 35 years younger than Francis Galton. He, too, was a man of wide inquiry like Galton. When he was young, he studied in Germany and conceived strong interest in the literature, history and philosophy there. He also seems to have kept a concern in socialism through Karl Marx. He kept refraining from being knighted all through his life, a fact that some suspects as a result of this inclination.
In around 1900, Pearson met Francis Galton through an introduction of Walter Weldon, a zoologist colleague. Since then, he was more than glad to be Galton's "statistical heir," and protege and contributed to the advancement of the Galtonian statistics until his last day. Pearson opened the world's first Department of Statistics in University College London in 1911.
Pearson set about the idea of Galton's regression, with the conviction that there was something to show quantitatively the degree of association between X and Y there.
First of all, either Variable X ( father's stature ) or Y ( son's ) takes its scores scattering around the mean at the center to form a Normal Distribution. Remember, here, the equation to get the coefficient "b" of Galton's regression Y = a + b X.
The numerator, "Σ ( X - Mx ) ( Y - My )," for the coefficient " b " is the sum of the product of the balances of X's and Y's from the mean, a value called "covariation." Let us take a study on this value a little further here.
There are four cases, A B C D, here as shown below. " A " is that father ( on X ) is less than the average but son ( on Y ) is more. ( Note that the upper half on Y, that is, the left-hand half viewed standing on Y axis, is more than the average, a disposition the other way around to regular normal curve tables. ) " B, " both father and son more than the average. " C, " both father and son are less. And " D, " father is more but son is less.
Now, we believe that son takes after his father. Then, the cases A [ father less but son more ] and D [ father more but son less ] are the "misses," whereas the cases B [ both father and son more ] and C [ both father and son less ] are the "hits" ( for our belief ) .
At the same time, this covariation is based on the sum of the balances from the mean. When the stature is more than the average, the balance is positive, whereas it is negative when it is less. So, in the cases A [ father is less but son more ] and D [ father more but son less ], the product is negative, and cases B [ both father and son are more ] and C [ both father and son less ] yield scores positive. In other words, positive products of father and son are the " hits " while negative ones are the " misses. "
Hence, the more hit cases are there ( right-hand picture in the chart ), the larger is the sum total of the products of all four of the A B C D sections in positive numbers. When hits and misses are the same ( left-hand picture ), it is zero. And, the more are there misses, the larger in negative numbers is the total sum. This is nothing other than the correlation, is not it? There must be something to measure the amount of correlation with here.
However, the amount of covariation varies in accordance with the size of N, and spans as far as plus and minus infinity, resulting in that it is not straightforwardly evident how large or small it is. Here, Pearson succeeded in this solution by dividing it by the geometric mean of the sums of the square of the balance from the mean. The quotient varies between plus and minus unity. He called the value " r, " the " correlation coefficient. " (*7) Thus, " r " is defined as:
And, as a calculating equation:
This is the definition of Pearson' Product-Moment Correlation Coefficient of " r. " And, as stated earlier, this " r " is, to one's wonder, equal to the square root of Galton's Portion Explained, that is, the product of the coefficients " b " and " b ' " of the least square line.
Now, let us get this " r, " the correlation coefficient of stature and weight of our Yokohama students.
Or, getting it from the geometric mean of " b " and " b ', " ( we already have those scores )
Thus, the score of " r " is the same ( r = .506 ) whether it is obtained through the covariation or by the coefficient of determination
We have obtained, thus, the correlation of " r = .506 " for the 79 Yokohama students. But, what about the students from another school? People from another group of other countries? Next is on the subject of the chance risk of this correlation.
Karl Pearson considers the problem of the chance risk as that involved in sampling a sample from the universe, the problem called the " sampling error. "
As to the correlation of stature and weight among the 79 Yokohama students, for instance, the seventy nine people are to be considered as a sample drawn out of the whole human population of some six billion men and women young and old. This sampling must have involved probable errors, as in any sampling. That is the nucleus of the problem of chance risk of the correlation, he maintains.
In this view of the universe and its sample, a score of the universe is called a " parameter " and that of a sample a " statistic. " And there is a tradition to write a " parameter " in Greek letters and a " statistic " in the Roman.
( In the meantime, it is because the parameters are not directly taken care in
the correlations such as φ, Q, γ, τb and τ we have seen that those are
called" nonparametric. " )
For example, if one writes as " M " for a mean, it refers to the mean of the sample, and "μ" ( " mu, " Greek for " m " ) the mean of the universe... In such a way, Pearson's Product-Moment correlation r refers to that of the sample. That of the universe is expressed in the Greek letter " ρ " ( " rho, " equivalent to " r " ). Then, how to infer the unknown parametric " rho " in safe and sound on the basis of the sample correlation " r = .506 " of the Yokohama students is the task imposed on us next.
A sample resembles the universe. Visiting, say, an orange orchard, you take a try to eat a couple of oranges there and they are sweet. Then, you feel to say some words of praise to the orchard owner that oranges of your mountain are so sweet...
Strictly speaking, however, you do not know yet how sweet all the oranges of that orchard are; you know just about a couple of them only. But your common sense tells that such a logic is nothing more than a quibble. That is to say that: you consider the whole oranges of the orchard as the universe and a couple of them you ate as the sample taken out from the universe. And, upon the basis of the truism that sample resembles universe, you infer on the nature of the whole mountain. But the problem remains with the sampling error...
By the same token, we consider the 79 Yokohama students as a sample taken out of the universe of the whole human race. The sample correlation was "r = .506." The truism that sample resembles universe makes us to infer that its correlation in the whole human race must be somewhere near .506. But inference does not guarantee but just suggests that it is "near" there.
Errors are known to form a Normal Distribution in many cases. The sample statistic of " r, " too, is expected to distribute in a Normal Curve when taken many times from the same universe with the parametric correlation of " ρ. " The score of " r " must occur the most frequently near the score of " ρ, " and the further it deviates from the " ρ, " the less frequent it must appear, so is it expected. But this expectation holds true, it is known, when " ρ " is sufficiently small or the sample size " N " is sufficiently large. Otherwise, sampling distribution of " r " distorts itself from the shape of the Normal Distribution.
Sir Ronald Fisher (1890-1962), great English biologist and statistician, tried to resolve this problem by correcting the " r " into " r ' " so that it sustains for any number of " N " and all sizes of the " ρ. " ( Fisher called so-corrected " r ' " as " z, " but it is much confusing with the standard score of " z." So, we will keep using the symbol " r ' " in the discussion to follow. ) This Fisher's correction is as follows:
where, " e " = 2.718… [base of the natural logarithm] z = Fisher's z
A concordance table of "r" to this "z" is also attached at the end of many textbooks. This " z " ( " r ' " ) is said to form a normal distribution with the Variance as follows:
At first, for the convenience of calculation, we presume that there is no correlation at all in the universe (ρ = 0 ). When samples are taken out of this universe, the sample correlation “ r ' " fluctuates in a normal distribution with the standard deviation given above.
It is known that the portion of the inside area of a normal distribution cut by the perpendiculars at the [ standard score z = ± 1.96 ] amounts to 95 percent of the total area. Then, the scores of “ r ' " at the points of the standard deviation multiplied by plus and minus 1.96 must correspond to the upper and lower limits of the estimate of “ρ” inferred on the 95 percent level of confidence.
Remember, however, that this inference has been done on the null hypothesis of "ρ = 0." But reality is that samples resemble the universe. So, put it back to the distribution with "ρ = r ’" by adding all those with " r ’. " And, again, since those scores are expressed in terms of "r '," they have to be reconverted into "r." Those are the procedure so complicated that we must itemize it as follows:
Procedure 1. Get the sample correlation " r " by the data in hand. 2. Convert " r " into " r ' " by Fisher's correction of " z. " 3. Get the standard deviation ( S.D. ) of " r ' ." 4. Determine the sampling distribution span of “ r' " at 95 % confidence ( when ρ= 0 ) by multiplying S.D. by ± 1.96. 5. Add the span with " r ' " to get the real ρ span. 6. Reconvert " r ' " back to " r " to get the ρ in terms of " r." |
Now, following this procedure step by step, let us find the span of inference of the distribution of “ρ," correlation of stature and weight of all human beings, on the basis of the data of our Yokohama students.
95 % span of r ’ (ρ= 0 ) = ±1.96 ×sd = ± 1.96 × .115 = ± .2254
95 % span of r ’ (ρ= r ' ) = the above + r' = ± .2254 + .558
Or,
Getting this "r ' " back to "r,"
That is the span of ρ score inferable at the 95 percent level of confidence. Namely,
The correlation of stature and weight of the whole human race is estimated to be somewhere in the range of .320 ≦ ρ ≦ .654 on the confidence level of 95 percent. |
That is the conclusion.
How do you like this conclusion? I, as a user of statistics, find three difficulties here.
The first is in the idea of span estimation. When stated that the truth is somewhere between this and that like this, it is not easy to comprehend the scene. It should be much easier to see the picture when pinpointed stating "the r is so and so and its chance risk is as little as.....," like in nonparametric correlations.
Suppose that you were a fleet commander. One source of intelligence tells you that the enemy is somewhere in the Pacific Ocean at the confidence level of 95 percent, and another source says that the enemy is at Coral Island such and such with a chance risk of less than 5 percent. Which intelligence source would you count on really?
The second difficulty is that neither " r " nor " ρ " is represented in a pictorial table, like the one such as an r-by-c table or coordinate tables in regression analysis. Numbers alone are not straightforwardly evident to comprehend the scene.
And the third is about the concept of sample and universe. What we really want to know is the intensity of an association and its chance risk. This parametric statistics assumes this risk all due to the sampling error. Is it sufficiently correct?
It is true that statistics, indices of samples, are not without sampling errors. But is not it so with parameters, status of a universe, too? Everything present or occurring in the universe is a mixture of necessity and fortuity, we maintain. Years ago, for instance, a very skinny actress called Twiggy was popular. World's ladies admired her and tried to shape themselves up to her proportion. A few years later, however, she was gone and a little plumper style became fashionable.
The ρ of the world ladies is moving along with such a shift of worldly fashion, fortuitous fluctuation apparently. Is it sufficiently correct to assume all the chance risk of r, including such in ρ, as the sampling error from the universe?
Why, in the world, do we have to stick so much to the concept of sample and universe? Is there any defect in examining the amount of chance risk directly on the facts we have in hand without counting on the concept of sampling? It should be much simpler in procedure to seek for the degree of chance risk directly like in nonparametric statistics. And, the simpler it is, the less room is left for errors to intervene in between.
At any rate, the methods to estimate the amount of chance risk we have seen so far are all constructed on the basis of the probability theories, a deductive approach. Deduction goes along with induction. Then, it should as well be possible for one to try an inductive method to know the probability by observing the real happenings of things, a method like by throwing a dice or the like. Such inductive methods are generally called "Monte-Carlo Simulation." Next to come below is one of such Monte simulations the present author has devised.
The Monte-Carlo Simulation is an experiment to see the likelihood of a correlation to occur by chance alone by actually making random incidents rather than to know it on probability logics alone. Such a simulation could be designed variously for particular cases. The simulation devised here is limited to 2 × 2 contingency tables, on the basis of the following algorithm. Let us take an example.
There were, say, a total of 157 students of 76 boys and 81 girls in a classroom. They were given an exam on mathematics, and those of 40 points or over were set as passed. Then, a subtotal of 74 students, or 47 percent, of 45 boys and 29 girls passed this exam in a total of 157 students, as shown below.
First of all, let us suppose that:
< Boys are not inherently superior in math learning to girls. >
If this presumption is true, then it is expected that about the same 47 percent of the boys as well as girls have passed the exam; that is, 36 boys ( 76 × .47 = 36 ) and 38 girls ( 81 × .47 = 38 ) must have passed.
Actually, however, 45 boys, 9 more than expected, and only 29 girls, 9 less, passed the exam, as shown in the above table. In other words, a correlation is present between the math achievement and the sexes.
The task before us here is to see how often correlations of such intensity or more occur by chance alone. In that operation, we take as given the total 157 students of 76 boys and 81 girls, and of 74 successful and 83 unsuccessful students, as shown in the margin of the table.
Now, in a 2×2 contingency table such as the above, when all the numbers of the margin are set, then any one of the four cells automatically determines the rest three. ( if the boys who passed are 45, then there are no scores for the rest three cells to take other than the ones given in the table. ) And, that there is a positive correlation ( r > 0 ) is that the numbers of the left-top cell and of right-bottom cell exceed those expected at r = 0. You can take the left-top cell or the right-bottom at your choice, because those two increase and decrease together. Let us take the left-top cell here.
And, that there is a correlation as such or stronger, when the margin is all set, means that the score of the left-top cell is the present 45 or more. A negative correlation is the case that the right-top and left-bottom cells exceed the expected, so that the right-top cell sees the given score or more. If the scores of the four cells are the same as the expected, there is no correlation ( r = 0 ). With those points in mind, let us distribute the 74 successful students for boys and girls by chance alone.
Let us first prepare two dishes, one for boys and the other for girls. And let us have 74 marbles, and throw them down one by one with your eyes covered. ( Marbles are to have no way but dropping down on either one of the dishes.) After all marbles were thus thrown, the number of the marbles on each dish is to be regarded as that of boys and girls who passed the exam.
Now, the ratio of the boys and girls here is not exactly one half. There are more girls ( 52 % ) than boys ( 48 % ). In accordance to this real ratio, we must prepare the dishes to be 52 percent large for girls and 48 percent for boys. We do this by generating random numbers by a computer.
A computer generates five-digit decimals from .00000 through .99999 randomly. If the number thus generated is .48000 or smaller ( 48 % ), we consider it fallen on the boys' dish, and when larger than that ( 52 % ), it is on the girls' dish. We repeat this 74 times, the total number of successful students. A round of those 74 innings makes a tournament.
When there are 45 or more successful boys in a tournament, it is regarded as a "win." Repeat the tournament thousands or tens of thousand of times. Divide the number of wins by the total number of the tournaments, and you will have your winning ratio. This ratio is the probability to get a positive correlation of the present strength or more by chance alone. Let us expand this procedure for more general cases.
We have a total of 74 successful students and a total of 76 boy students in the present case. It may be possible that the 74 happy students were all boys. But in another classroom, say, it may well be that successful students were 74 but that there were only 60 boys in the room. In such a class, it is impossible for all successful students to be boys.
So, in a positive correlation, a procedure is necessary that you compare the right-top margin with the left-bottom and that you set the margin of the smaller number as the total innings. And, when you set left-bottom margin as the total innings, the dishes are not for boys and girls, but for the successful and unsuccessful boys. Accordingly, the sizes of the dishes are to be prepared at the ratio of successful and unsuccessful students, not of boys and girls.
In the case of a negative correlation, it is the right-top cell ( successful girls ), instead of the left top ( successful boys ), to matter. The rest is the same as the positive case. ( The closer to - 1, the stronger is a negative correlation. ) The essential part of the computer program of this procedure is attached to the Notes (*9).
For this experiment, we prepared the following four contingency tables by varying the frequencies of the cells with the margin set as the same as the above so as for the correlations to range from the near-nil one of r = .005 to the strongest of r = .362. The case No.2 is the one we have used here so far.
We conducted the tournament 10,000 times each to gain the winning ratio, “ p,” and repeated this series ten times on each of the four contingency tables.
For a comparison, we refer this to the scores of CHI Square Test and Fisher's Exact Test. Both our Monte and Fisher's are one-tailed tests whereas CHI Square is two-tailed. So, we use one half of the score of " p " originally yielded by CHI Square (*10). The result was as follows:
What does this experiment tell us? It tells, first of all, that chance transforms itself into necessity when repeated no less than ten thousand times ( Law of large numbers ), doesn't it?
At the correlation near zero, r = .005, the chance occurrences are concentrated in a small range from 52.5 % ( 525 times per a thousand ) to 54.5 percent. For a real zero correlation, the probability for it or stronger to occur by chance is a little over 50 percent ( " a little over" because of the case of zero being included ). The occurrences of 52 to 54 percent at the level of r = .005 are to be said within the range of this expectation.
With the strongest correlation of r = .362, on the other hand, the chance occurrence is .1 and .0 percent only. It tells that such happenings never occur once in a hundred times; or, if any, only once in a thousand times or less. And for those in between of r = .2 to r = .3, the probability ranges from 2 to .2 percent.
To compare those results with CHI Square and Fisher, computations on pure logic, the score of CHI Square ( p < .475 ) is close to Monte at the level of near zero, r = .005, while Fisher's " p = .127 " is one quarter as small as Monte and CHI Square.
At the medium level of r = .234 and r = .311, the predictions of CHI Square as well as Fisher are roughly one tenth of our Monte scores. For the strongest level of r = .362, it is five a hundredth ( CHI Square ) to four a thousandth ( Fisher ) as small as our Monte, gigantic differences apparently.
Gigantic those differences appearing to be, but they are only a matter of pure numbers, presenting no problem in practice. When our common sense suggests that happenings with a chance risk of more than 5 percent are due to chance, what to do with the risk of 12 percent or 49 percent? And, it is the same thing as to say it is real ( not of chance ) when the chance risk is .2, .25, .05 or .0004 percent. Those chance risks are the same as zero, practically speaking.
Namely, all the three methods, CHI Square, Fisher's and Monte compared here, agree with one to another, when dealing with the total number N = 157, that Case 1 where r = .005 is to be judged as due to chance, and that all the others with the correlations of r = .234, r = .311 and r = .362 must be judged as real, not due to chance.
Lastly, it should be fair to restate that the simulation conducted here was done by a computer, a thinking machine. It is next to impossible for a machine to run a "random" performance in its full sense unless dealing with the world of quantum dynamics ( *11). The randomness our computer made here was an apparent randomness. But what matters us here is not so much on whether it is truely random or not but on how random is random enough for the situation we are faced here and then.
The Monte Carlo Simulation presented here may be judged as tenable, at least, as much as the mathematical calculations based on probability theory such as CHI Square Test and Fisher's Exact Test.
' (1,1), X(1,2), X(2,1), X(2,2) are the values of the 4 cells. For I=1 to 2: For K=1 to 2: input X(I,K): next K, I E=X(1,1)+X(1,2): F=X(2,1)+X(2,2): G=X(1,1)+X(2,1): H=X(1,2)+X(2,2): N=E+F if r>=0 and E<=G then PT=E: SP=G/N: CT=X(1,1) if r>=0 and E>G then PT=G: SP=E/N: CT=X(1,1) if r<0 and E<=H then PT=E: SP=H/N: CT=X(1,2) if r<0 and E>H then PT=H: SP=E/N: CT=X(1,2) input" How many rounds do you want to repeat?";RP RD$=right$(Time$,2): RD=val(RD$) randomize RD for I=1 to RP for K=1 to PT SD=rnd: if SD>=SP then LF=LF+1 next K if LF>=CT then RT=RT+1 LF=0:next I P=RT/RP print "p=";p stop end |
Z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 ------------------------------------------------------------ .0 : .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641 .1 : .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247 .2 : .4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3857 .3 : .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483 .4 : .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121 .5 : .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776 .6 : .2743 .2709 .2676 .2643 .2620 .2578 .2546 .2514 .2483 .2451 .7 : .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148 .8 : .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1921 .1894 .1867 .9 : .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611 1.0 : .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379 1.1 : .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170 1.2 : .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .0985 1.3 : .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823 1.4 : .0808 .0793 .0778 .0764 .0749 .0735 .0721 .0708 .0694 .0681 1.5 : .0668 .0655 .0643 .0630 .1618 .0606 .0594 .0582 .0571 .0559 1.6 : .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455 1.7 : .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367 1.8 : .0359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .0294 1.9 : .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233 2.0 : .0227 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183 2.1 : .0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .0143 2.2 : .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .0110 2.3 : .0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .0085 2.4 : .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .0034 2.5 : .0062 .0062 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .0048 2.6 : .0047 .0045 .0044 .0043 .0041 .0040 .0039 .0038 .0037 .0036 2.7 : .0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .0026 2.8 : .0026 .0025 .0024 .0023 .0023 .0022 .0021 .0021 .0020 .0019 2.9 : .0019 .0018 .0017 .0017 .0016 .0016 .0015 .0015 .0014 .0014 3.0 : .0014 .0013 .0013 .0012 .0012 .0011 .0011 .0011 .0010 .0010 3.5 : .0002326 4.0 : .0000317 4.5 : .0000034 5.0 : .0000002867 ------------------------------------------------------------- |
p χ2 p χ2 ------------------------------------------------ .001 10.826 .500 .454937 .005 7.87944 .750 .1015308 .010 6.63490 .900 .0157908 .025 5.02389 .950 .00393214 .050 3.84146 .975 .000982069 .100 2.70554 .990 .000157088 .250 1.32330 .995 .0000392704 ------------------------------------------------ |