The correlation coefficient is 0 then. Significance of the correlation coefficient. Correlation analysis is

The correlation coefficient is the degree of relationship between two variables. Its calculation gives an idea of ​​whether there is a relationship between two data sets. Unlike regression, correlation does not predict the values ​​of quantities. However, calculating the coefficient is an important step in preliminary statistical analysis. For example, we found that the correlation coefficient between the level of foreign direct investment and the GDP growth rate is high. This gives us the idea that to ensure prosperity, it is necessary to create a favorable climate specifically for foreign entrepreneurs. Not such an obvious conclusion at first glance!

Correlation and Causality

Perhaps there is not a single area of ​​statistics that has become so firmly established in our lives. The correlation coefficient is used in all areas of social knowledge. Its main danger is that its high values ​​are often speculated on in order to convince people and make them believe in some conclusions. However, in fact, a strong correlation does not at all indicate a cause-and-effect relationship between quantities.

Correlation coefficient: Pearson and Spearman formula

There are several basic indicators that characterize the relationship between two variables. Historically, the first is the Pearson linear correlation coefficient. It is taught at school. It was developed by K. Pearson and J. Yule based on the work of Fr. Galton. This coefficient allows you to see the relationship between rational numbers that change rationally. It is always greater than -1 and less than 1. A negative number indicates an inversely proportional relationship. If the coefficient is zero, then there is no relationship between the variables. Equal to a positive number - there is a directly proportional relationship between the quantities under study. Spearman's rank correlation coefficient allows you to simplify calculations by building a hierarchy of variable values.

Relationships between variables

Correlation helps answer two questions. First, whether the relationship between the variables is positive or negative. Secondly, how strong is the addiction. Correlation analysis is a powerful tool that can provide this important information. It is easy to see that family income and expenses fall and rise proportionally. This relationship is considered positive. On the contrary, when the price of a product rises, the demand for it falls. This relationship is called negative. The values ​​of the correlation coefficient range between -1 and 1. Zero means that there is no relationship between the values ​​under study. The closer the obtained indicator is to extreme values, the stronger the relationship (negative or positive). The absence of dependence is indicated by a coefficient from -0.1 to 0.1. You need to understand that such a value only indicates the absence of a linear relationship.

Features of application

The use of both indicators involves certain assumptions. Firstly, the presence of a strong connection does not determine the fact that one quantity determines the other. There may well be a third quantity that defines each of them. Secondly, a high Pearson correlation coefficient does not indicate a cause-and-effect relationship between the studied variables. Thirdly, it shows an exclusively linear relationship. Correlation can be used to evaluate meaningful quantitative data (eg, barometric pressure, air temperature) rather than categories such as gender or favorite color.

Multiple correlation coefficient

Pearson and Spearman examined the relationship between two variables. But what to do if there are three or even more of them. This is where the multiple correlation coefficient comes to the rescue. For example, the gross national product is influenced not only by foreign direct investment, but also by the government's monetary and fiscal policies, as well as the level of exports. The growth rate and volume of GDP are the result of the interaction of a number of factors. However, it must be understood that the multiple correlation model is based on a number of simplifications and assumptions. Firstly, multicollinearity between values ​​is excluded. Secondly, the relationship between the dependent and the variables influencing it is considered linear.

Areas of use of correlation and regression analysis

This method of finding relationships between quantities is widely used in statistics. It is most often resorted to in three main cases:

  1. To test cause-and-effect relationships between the values ​​of two variables. As a result, the researcher hopes to discover a linear relationship and derive a formula that describes these relationships between quantities. Their units of measurement may be different.
  2. To check for a relationship between quantities. In this case, no one determines which variable is the dependent variable. It may turn out that some other factor determines the value of both quantities.
  3. To derive Eq. In this case, you can simply substitute numbers into it and find out the values ​​of the unknown variable.

A man in search of a cause-and-effect relationship

Consciousness is designed in such a way that we definitely need to explain the events that happen around us. A person always looks for a connection between the picture of the world in which he lives and the information he receives. The brain often creates order out of chaos. He can easily see a cause-and-effect relationship where there is none. Scientists have to specifically learn to overcome this tendency. The ability to evaluate relationships between data objectively is essential in an academic career.

Media bias

Let's consider how the presence of a correlation can be misinterpreted. A group of British students with bad behavior were asked whether their parents smoked. Then the test was published in the newspaper. The result showed a strong correlation between parental smoking and their children's delinquency. The professor who conducted this study even suggested putting a warning about this on cigarette packs. However, there are a number of problems with this conclusion. First, correlation does not show which of the quantities is independent. Therefore, it is quite possible to assume that the harmful habit of parents is caused by the disobedience of children. Secondly, it cannot be said with certainty that both problems did not arise due to some third factor. For example, low income families. It is worth noting the emotional aspect of the initial findings of the professor who conducted the study. He was an ardent opponent of smoking. Therefore, it is not surprising that he interpreted the results of his research in this way.

conclusions

Misinterpreting a correlation as a cause-and-effect relationship between two variables can cause disgraceful research errors. The problem is that it lies at the very basis of human consciousness. Many marketing tricks are based on this feature. Understanding the difference between cause and effect and correlation allows you to rationally analyze information both in your daily life and in your professional career.

The correlation coefficient is the degree of relationship between two variables. Its calculation gives an idea of ​​whether there is a relationship between two data sets. Unlike regression, correlation does not predict the values ​​of quantities. However, calculating the coefficient is an important step in preliminary statistical analysis. For example, we found that the correlation coefficient between the level of foreign direct investment and the GDP growth rate is high. This gives us the idea that to ensure prosperity, it is necessary to create a favorable climate specifically for foreign entrepreneurs. Not such an obvious conclusion at first glance!

Correlation and Causality

Perhaps there is not a single area of ​​statistics that has become so firmly established in our lives. The correlation coefficient is used in all areas of social knowledge. Its main danger is that its high values ​​are often speculated on in order to convince people and make them believe in some conclusions. However, in fact, a strong correlation does not at all indicate a cause-and-effect relationship between quantities.

Correlation coefficient: Pearson and Spearman formula

There are several basic indicators that characterize the relationship between two variables. Historically, the first is the Pearson linear correlation coefficient. It is taught at school. It was developed by K. Pearson and J. Yule based on the work of Fr. Galton. This coefficient allows you to see the relationship between rational numbers that change rationally. It is always greater than -1 and less than 1. A negative number indicates an inversely proportional relationship. If the coefficient is zero, then there is no relationship between the variables. Equal to a positive number - there is a directly proportional relationship between the quantities under study. Spearman's rank correlation coefficient allows you to simplify calculations by building a hierarchy of variable values.

Relationships between variables

Correlation helps answer two questions. First, whether the relationship between the variables is positive or negative. Secondly, how strong is the addiction. Correlation analysis is a powerful tool that can provide this important information. It is easy to see that family income and expenses fall and rise proportionally. This relationship is considered positive. On the contrary, when the price of a product rises, the demand for it falls. This relationship is called negative. The values ​​of the correlation coefficient range between -1 and 1. Zero means that there is no relationship between the values ​​under study. The closer the obtained indicator is to extreme values, the stronger the relationship (negative or positive). The absence of dependence is indicated by a coefficient from -0.1 to 0.1. You need to understand that such a value only indicates the absence of a linear relationship.

Features of application

The use of both indicators involves certain assumptions. Firstly, the presence of a strong connection does not determine the fact that one quantity determines the other. There may well be a third quantity that defines each of them. Secondly, a high Pearson correlation coefficient does not indicate a cause-and-effect relationship between the studied variables. Thirdly, it shows an exclusively linear relationship. Correlation can be used to evaluate meaningful quantitative data (eg, barometric pressure, air temperature) rather than categories such as gender or favorite color.

Multiple correlation coefficient

Pearson and Spearman examined the relationship between two variables. But what to do if there are three or even more of them. This is where the multiple correlation coefficient comes to the rescue. For example, the gross national product is influenced not only by foreign direct investment, but also by the government's monetary and fiscal policies, as well as the level of exports. The growth rate and volume of GDP are the result of the interaction of a number of factors. However, it must be understood that the multiple correlation model is based on a number of simplifications and assumptions. Firstly, multicollinearity between values ​​is excluded. Secondly, the relationship between the dependent and the variables influencing it is considered linear.

Areas of use of correlation and regression analysis

This method of finding relationships between quantities is widely used in statistics. It is most often resorted to in three main cases:

  1. To test cause-and-effect relationships between the values ​​of two variables. As a result, the researcher hopes to discover a linear relationship and derive a formula that describes these relationships between quantities. Their units of measurement may be different.
  2. To check for a relationship between quantities. In this case, no one determines which variable is the dependent variable. It may turn out that some other factor determines the value of both quantities.
  3. To derive Eq. In this case, you can simply substitute numbers into it and find out the values ​​of the unknown variable.

A man in search of a cause-and-effect relationship

Consciousness is designed in such a way that we definitely need to explain the events that happen around us. A person always looks for a connection between the picture of the world in which he lives and the information he receives. The brain often creates order out of chaos. He can easily see a cause-and-effect relationship where there is none. Scientists have to specifically learn to overcome this tendency. The ability to evaluate relationships between data objectively is essential in an academic career.

Media bias

Let's consider how the presence of a correlation can be misinterpreted. A group of British students with bad behavior were asked whether their parents smoked. Then the test was published in the newspaper. The result showed a strong correlation between parental smoking and their children's delinquency. The professor who conducted this study even suggested putting a warning about this on cigarette packs. However, there are a number of problems with this conclusion. First, correlation does not show which of the quantities is independent. Therefore, it is quite possible to assume that the harmful habit of parents is caused by the disobedience of children. Secondly, it cannot be said with certainty that both problems did not arise due to some third factor. For example, low income families. It is worth noting the emotional aspect of the initial findings of the professor who conducted the study. He was an ardent opponent of smoking. Therefore, it is not surprising that he interpreted the results of his research in this way.

conclusions

Misinterpreting a correlation as a cause-and-effect relationship between two variables can cause disgraceful research errors. The problem is that it lies at the very basis of human consciousness. Many marketing tricks are based on this feature. Understanding the difference between cause and effect and correlation allows you to rationally analyze information both in your daily life and in your professional career.

When studying public health and healthcare for scientific and practical purposes, the researcher often has to conduct a statistical analysis of the relationships between factor and performance characteristics of a statistical population (causal relationship) or determine the dependence of parallel changes in several characteristics of this population on some third value (on their common cause ). It is necessary to be able to study the features of this connection, determine its size and direction, and also evaluate its reliability. For this purpose, correlation methods are used.

  1. Types of manifestation of quantitative relationships between characteristics
    • functional connection
    • correlation connection
  2. Definitions of functional and correlational connection

    Functional connection- this type of relationship between two characteristics when each value of one of them corresponds to a strictly defined value of the other (the area of ​​a circle depends on the radius of the circle, etc.). Functional connection is characteristic of physical and mathematical processes.

    Correlation- such a relationship in which each specific value of one characteristic corresponds to several values ​​of another characteristic interrelated with it (the relationship between a person’s height and weight; the relationship between body temperature and pulse rate, etc.). Correlation is typical for medical and biological processes.

  3. The practical significance of establishing a correlation connection. Identification of cause and effect between factor and resultant signs (when assessing physical development, to determine the connection between working conditions, living conditions and health status, when determining the dependence of the frequency of illnesses on age, length of service, the presence of occupational hazards, etc.)

    Dependence of parallel changes in several characteristics on some third value. For example, under the influence of high temperature in the workshop, changes in blood pressure, blood viscosity, pulse rate, etc. occur.

  4. A value characterizing the direction and strength of the relationship between characteristics. The correlation coefficient, which in one number gives an idea of ​​the direction and strength of the connection between signs (phenomena), the limits of its fluctuations from 0 to ± 1
  5. Methods of presenting correlations
    • graph (scatter plot)
    • correlation coefficient
  6. Direction of correlation
    • straight
    • reverse
  7. Strength of correlation
    • strong: ±0.7 to ±1
    • average: ±0.3 to ±0.699
    • weak: 0 to ±0.299
  8. Methods for determining the correlation coefficient and formulas
    • method of squares (Pearson method)
    • rank method (Spearman method)
  9. Methodological requirements for using the correlation coefficient
    • measuring the relationship is only possible in qualitatively homogeneous populations (for example, measuring the relationship between height and weight in populations homogeneous by gender and age)
    • calculation can be made using absolute or derived values
    • to calculate the correlation coefficient, ungrouped variation series are used (this requirement applies only when calculating the correlation coefficient using the method of squares)
    • number of observations at least 30
  10. Recommendations for using the rank correlation method (Spearman's method)
    • when there is no need to accurately establish the strength of the connection, but approximate data is sufficient
    • when characteristics are represented not only by quantitative, but also by attributive values
    • when the distribution series of characteristics have open options (for example, work experience up to 1 year, etc.)
  11. Recommendations for using the method of squares (Pearson's method)
    • when an accurate determination of the strength of connection between characteristics is required
    • when signs have only quantitative expression
  12. Methodology and procedure for calculating the correlation coefficient

    1) Method of squares

    2) Rank method

  13. Scheme for assessing the correlation relationship using the correlation coefficient
  14. Calculation of correlation coefficient error
  15. Estimation of the reliability of the correlation coefficient obtained by the rank correlation method and the method of squares

    Method 1
    Reliability is determined by the formula:

    The t criterion is evaluated using a table of t values, taking into account the number of degrees of freedom (n - 2), where n is the number of paired options. The t criterion must be equal to or greater than the table one, corresponding to a probability p ≥99%.

    Method 2
    Reliability is assessed using a special table of standard correlation coefficients. In this case, a correlation coefficient is considered reliable when, with a certain number of degrees of freedom (n - 2), it is equal to or more than the tabular one, corresponding to the degree of error-free prediction p ≥95%.

to use the method of squares

Exercise: calculate the correlation coefficient, determine the direction and strength of the relationship between the amount of calcium in water and water hardness, if the following data are known (Table 1). Assess the reliability of the relationship. Draw a conclusion.

Table 1

Justification for the choice of method. To solve the problem, the method of squares (Pearson) was chosen, because each of the signs (water hardness and amount of calcium) has a numerical expression; no open option.

Solution.
The sequence of calculations is described in the text, the results are presented in the table. Having constructed series of paired comparable characteristics, denote them by x (water hardness in degrees) and by y (amount of calcium in water in mg/l).

Hardness of water
(in degrees)
Amount of calcium in water
(in mg/l)
d x d y d x x d y d x 2 d y 2
4
8
11
27
34
37
28
56
77
191
241
262
-16
-12
-9
+7
+14
+16
-114
-86
-66
+48
+98
+120
1824
1032
594
336
1372
1920
256
144
81
49
196
256
12996
7396
4356
2304
9604
14400
M x =Σ x / n M y =Σ y / n Σ d x x d y =7078 Σ d x 2 =982 Σ d y 2 =51056
M x =120/6=20 M y =852/6=142
  1. Determine the average values ​​of M x in the row option “x” and M y in the row option “y” using the formulas:
    M x = Σх/n (column 1) and
    M y = Σу/n (column 2)
  2. Find the deviation (d x and d y) of each option from the calculated average in the series “x” and in the series “y”
    d x = x - M x (column 3) and d y = y - M y (column 4).
  3. Find the product of deviations d x x d y and sum them up: Σ d x x d y (column 5)
  4. Square each deviation d x and d y and sum their values ​​along the “x” series and the “y” series: Σ d x 2 = 982 (column 6) and Σ d y 2 = 51056 (column 7).
  5. Determine the product Σ d x 2 x Σ d y 2 and extract the square root from this product
  6. The resulting values ​​Σ (d x x d y) and √ (Σd x 2 x Σd y 2) substitute into the formula for calculating the correlation coefficient:
  7. Determine the reliability of the correlation coefficient:
    1st method. Find the error of the correlation coefficient (mr xy) and the t criterion using the formulas:

    Criterion t = 14.1, which corresponds to the probability of an error-free forecast p > 99.9%.

    2nd method. The reliability of the correlation coefficient is assessed using the table “Standard correlation coefficients” (see Appendix 1). With the number of degrees of freedom (n - 2)=6 - 2=4, our calculated correlation coefficient r xу = + 0.99 is greater than the tabulated one (r table = + 0.917 at p = 99%).

    Conclusion. The more calcium in water, the harder it is (connection direct, strong and authentic: r xy = + 0.99, p > 99.9%).

    to use the ranking method

    Exercise: Using the rank method, establish the direction and strength of the relationship between years of work experience and the frequency of injuries if the following data are obtained:

    Justification for choosing the method: To solve the problem, only the rank correlation method can be chosen, because The first row of the attribute “work experience in years” has open options (work experience up to 1 year and 7 or more years), which does not allow the use of a more accurate method - the method of squares - to establish a connection between the compared characteristics.

    Solution. The sequence of calculations is presented in the text, the results are presented in table. 2.

    table 2

    Work experience in years Number of injuries Ordinal numbers (ranks) Rank difference Squared difference of ranks
    X Y d(x-y) d 2
    Up to 1 year 24 1 5 -4 16
    1-2 16 2 4 -2 4
    3-4 12 3 2,5 +0,5 0,25
    5-6 12 4 2,5 +1,5 2,25
    7 or more 6 5 1 +4 16
    Σ d 2 = 38.5

    Standard correlation coefficients that are considered reliable (according to L.S. Kaminsky)

    Number of degrees of freedom - 2 Probability level p (%)
    95% 98% 99%
    1 0,997 0,999 0,999
    2 0,950 0,980 0,990
    3 0,878 0,934 0,959
    4 0,811 0,882 0,917
    5 0,754 0,833 0,874
    6 0,707 0,789 0,834
    7 0,666 0,750 0,798
    8 0,632 0,716 0,765
    9 0,602 0,885 0,735
    10 0,576 0,858 0,708
    11 0,553 0,634 0,684
    12 0,532 0,612 0,661
    13 0,514 0,592 0,641
    14 0,497 0,574 0,623
    15 0,482 0,558 0,606
    16 0,468 0,542 0,590
    17 0,456 0,528 0,575
    18 0,444 0,516 0,561
    19 0,433 0,503 0,549
    20 0,423 0,492 0,537
    25 0,381 0,445 0,487
    30 0,349 0,409 0,449

    1. Vlasov V.V. Epidemiology. - M.: GEOTAR-MED, 2004. - 464 p.
    2. Lisitsyn Yu.P. Public health and healthcare. Textbook for universities. - M.: GEOTAR-MED, 2007. - 512 p.
    3. Medic V.A., Yuryev V.K. Course of lectures on public health and healthcare: Part 1. Public health. - M.: Medicine, 2003. - 368 p.
    4. Minyaev V.A., Vishnyakov N.I. and others. Social medicine and healthcare organization (Manual in 2 volumes). - St. Petersburg, 1998. -528 p.
    5. Kucherenko V.Z., Agarkov N.M. and others. Social hygiene and healthcare organization (Tutorial) - Moscow, 2000. - 432 p.
    6. S. Glanz. Medical and biological statistics. Translation from English - M., Praktika, 1998. - 459 p.
Notice! The solution to your specific problem will look similar to this example, including all the tables and explanatory texts below, but taking into account your initial data...

Task:
There is a related sample of 26 pairs of values ​​(x k,y k):

k 1 2 3 4 5 6 7 8 9 10
x k 25.20000 26.40000 26.00000 25.80000 24.90000 25.70000 25.70000 25.70000 26.10000 25.80000
y k 30.80000 29.40000 30.20000 30.50000 31.40000 30.30000 30.40000 30.50000 29.90000 30.40000

k 11 12 13 14 15 16 17 18 19 20
x k 25.90000 26.20000 25.60000 25.40000 26.60000 26.20000 26.00000 22.10000 25.90000 25.80000
y k 30.30000 30.50000 30.60000 31.00000 29.60000 30.40000 30.70000 31.60000 30.50000 30.60000

k 21 22 23 24 25 26
x k 25.90000 26.30000 26.10000 26.00000 26.40000 25.80000
y k 30.70000 30.10000 30.60000 30.50000 30.70000 30.80000

Required to calculate/plot:
- correlation coefficient;
- test the hypothesis of the dependence of random variables X and Y, at a significance level of α = 0.05;
- linear regression equation coefficients;
- scatter diagram (correlation field) and regression line graph;

SOLUTION:

1. Calculate the correlation coefficient.

The correlation coefficient is an indicator of the mutual probabilistic influence of two random variables. Correlation coefficient R can take values ​​from -1 before +1 . If the absolute value is closer to 1 , then this is evidence of a strong connection between quantities, and if closer to 0 - then this indicates a weak connection or its absence. If absolute value R equals one, then we can talk about a functional connection between quantities, that is, one quantity can be expressed through another using a mathematical function.


The correlation coefficient can be calculated using the following formulas:
n
Σ
k = 1
(x k -M x) 2 , σ y 2 =
M x =
1
n
n
Σ
k = 1
xk, M y =

or by formula

Rx,y =
M xy - M x M y
S x S y
(1.4), where:
M x =
1
n
n
Σ
k = 1
xk, M y =
1
n
n
Σ
k = 1
y k , Mxy =
1
n
n
Σ
k = 1
x k y k (1.5)
S x 2 =
1
n
n
Σ
k = 1
x k 2 - M x 2, S y 2 =
1
n
n
Σ
k = 1
y k 2 - M y 2 (1.6)

In practice, formula (1.4) is more often used to calculate the correlation coefficient because it requires less computation. However, if the covariance was previously calculated cov(X,Y), then it is more profitable to use formula (1.1), because In addition to the covariance value itself, you can also use the results of intermediate calculations.

1.1 Let's calculate the correlation coefficient using formula (1.4), to do this, we calculate the values ​​of x k 2, y k 2 and x k y k and enter them into Table 1.

Table 1


k
x k y k x k 2 y k 2 x ky k
1 2 3 4 5 6
1 25.2 30.8 635.04000 948.64000 776.16000
2 26.4 29.4 696.96000 864.36000 776.16000
3 26.0 30.2 676.00000 912.04000 785.20000
4 25.8 30.5 665.64000 930.25000 786.90000
5 24.9 31.4 620.01000 985.96000 781.86000
6 25.7 30.3 660.49000 918.09000 778.71000
7 25.7 30.4 660.49000 924.16000 781.28000
8 25.7 30.5 660.49000 930.25000 783.85000
9 26.1 29.9 681.21000 894.01000 780.39000
10 25.8 30.4 665.64000 924.16000 784.32000
11 25.9 30.3 670.81000 918.09000 784.77000
12 26.2 30.5 686.44000 930.25000 799.10000
13 25.6 30.6 655.36000 936.36000 783.36000
14 25.4 31 645.16000 961.00000 787.40000
15 26.6 29.6 707.56000 876.16000 787.36000
16 26.2 30.4 686.44000 924.16000 796.48000
17 26 30.7 676.00000 942.49000 798.20000
18 22.1 31.6 488.41000 998.56000 698.36000
19 25.9 30.5 670.81000 930.25000 789.95000
20 25.8 30.6 665.64000 936.36000 789.48000
21 25.9 30.7 670.81000 942.49000 795.13000
22 26.3 30.1 691.69000 906.01000 791.63000
23 26.1 30.6 681.21000 936.36000 798.66000
24 26 30.5 676.00000 930.25000 793.00000
25 26.4 30.7 696.96000 942.49000 810.48000
26 25.8 30.8 665.64000 948.64000 794.64000


1.2. Let's calculate M x using formula (1.5).

1.2.1. x k

x 1 + x 2 + … + x 26 = 25.20000 + 26.40000 + ... + 25.80000 = 669.500000

1.2.2.

669.50000 / 26 = 25.75000

M x = 25.750000

1.3. Let us calculate M y in a similar way.

1.3.1. Let's add all the elements sequentially y k

y 1 + y 2 + … + y 26 = 30.80000 + 29.40000 + ... + 30.80000 = 793.000000

1.3.2. Divide the resulting sum by the number of sample elements

793.00000 / 26 = 30.50000

M y = 30.500000

1.4. In a similar way we calculate M xy.

1.4.1. Let's add sequentially all the elements of the 6th column of table 1

776.16000 + 776.16000 + ... + 794.64000 = 20412.830000

1.4.2. Divide the resulting sum by the number of elements

20412.83000 / 26 = 785.10885

M xy = 785.108846

1.5. Let's calculate the value of S x 2 using formula (1.6.).

1.5.1. Let's add sequentially all the elements of the 4th column of table 1

635.04000 + 696.96000 + ... + 665.64000 = 17256.910000

1.5.2. Divide the resulting sum by the number of elements

17256.91000 / 26 = 663.72731

1.5.3. Subtract the square of M x from the last number to obtain the value for S x 2

S x 2 = 663.72731 - 25.75000 2 = 663.72731 - 663.06250 = 0.66481

1.6. Let's calculate the value of S y 2 using formula (1.6.).

1.6.1. Let's add sequentially all the elements of the 5th column of table 1

948.64000 + 864.36000 + ... + 948.64000 = 24191.840000

1.6.2. Divide the resulting sum by the number of elements

24191.84000 / 26 = 930.45538

1.6.3. Subtract the square of M y from the last number to obtain the value for S y 2

S y 2 = 930.45538 - 30.50000 2 = 930.45538 - 930.25000 = 0.20538

1.7. Let's calculate the product of the quantities S x 2 and S y 2.

S x 2 S y 2 = 0.66481 0.20538 = 0.136541

1.8. Let's take the square root of the last number and get the value S x S y.

S x S y = 0.36951

1.9. Let's calculate the value of the correlation coefficient using formula (1.4.).

R = (785.10885 - 25.75000 30.50000) / 0.36951 = (785.10885 - 785.37500) / 0.36951 = -0.72028

ANSWER: R x,y = -0.720279

2. We check the significance of the correlation coefficient (we check the hypothesis of dependence).

Because the correlation coefficient estimate is calculated on a finite sample, and therefore may deviate from its population value, it is necessary to test the significance of the correlation coefficient. Checking is done using the t-test:

t =
Rx,y
n - 2
1 - R 2 x,y
(2.1)

Random value t follows Student's t-distribution and using the t-distribution table it is necessary to find the critical value of the criterion (t cr.α) at ​​a given significance level α. If t calculated by formula (2.1) in absolute value turns out to be less than t cr.α , then there is no dependence between the random variables X and Y. Otherwise, the experimental data do not contradict the hypothesis about the dependence of random variables.


2.1. Let us calculate the value of the t-criterion using formula (2.1) and obtain:
t =
-0.72028
26 - 2
1 - (-0.72028) 2
= -5.08680

2.2. Using the t-distribution table, we determine the critical value of the parameter t cr.α

The desired value of tcr.α is located at the intersection of the row corresponding to the number of degrees of freedom and the column corresponding to the given significance level α.
In our case, the number of degrees of freedom is n - 2 = 26 - 2 = 24 and α = 0.05 , which corresponds to the critical value of the criterion t cr.α = 2.064 (see table 2)

table 2 t-distribution

Number of degrees of freedom
(n - 2)
α = 0.1 α = 0.05 α = 0.02 α = 0.01 α = 0.002 α = 0.001
1 6.314 12.706 31.821 63.657 318.31 636.62
2 2.920 4.303 6.965 9.925 22.327 31.598
3 2.353 3.182 4.541 5.841 10.214 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869
6 1.943 2.447 3.143 3.707 5.208 5.959
7 1.895 2.365 2.998 3.499 4.785 5.408
8 1.860 2.306 2.896 3.355 4.501 5.041
9 1.833 2.262 2.821 3.250 4.297 4.781
10 1.812 2.228 2.764 3.169 4.144 4.587
11 1.796 2.201 2.718 3.106 4.025 4.437
12 1.782 2.179 2.681 3.055 3.930 4.318
13 1.771 2.160 2.650 3.012 3.852 4.221
14 1.761 2.145 2.624 2.977 3.787 4.140
15 1.753 2.131 2.602 2.947 3.733 4.073
16 1.746 2.120 2.583 2.921 3.686 4.015
17 1.740 2.110 2.567 2.898 3.646 3.965
18 1.734 2.101 2.552 2.878 3.610 3.922
19 1.729 2.093 2.539 2.861 3.579 3.883
20 1.725 2.086 2.528 2.845 3.552 3.850
21 1.721 2.080 2.518 2.831 3.527 3.819
22 1.717 2.074 2.508 2.819 3.505 3.792
23 1.714 2.069 2.500 2.807 3.485 3.767
24 1.711 2.064 2.492 2.797 3.467 3.745
25 1.708 2.060 2.485 2.787 3.450 3.725
26 1.706 2.056 2.479 2.779 3.435 3.707
27 1.703 2.052 2.473 2.771 3.421 3.690
28 1.701 2.048 2.467 2.763 3.408 3.674
29 1.699 2.045 2.462 2.756 3.396 3.659
30 1.697 2.042 2.457 2.750 3.385 3.646
40 1.684 2.021 2.423 2.704 3.307 3.551
60 1.671 2.000 2.390 2.660 3.232 3.460
120 1.658 1.980 2.358 2.617 3.160 3.373
1.645 1.960 2.326 2.576 3.090 3.291


2.2. Let's compare the absolute value of the t-criterion and t cr.α

The absolute value of the t-criterion is not less than the critical value t = 5.08680, t cr.α = 2.064, therefore experimental data, with probability 0.95(1 - α), do not contradict the hypothesis about the dependence of random variables X and Y.

3. Calculate the coefficients of the linear regression equation.

A linear regression equation is an equation of a straight line that approximates (approximately describes) the relationship between random variables X and Y. If we assume that the value X is free and Y is dependent on X, then the regression equation will be written as follows


Y = a + b X (3.1), where:

b =Rx,y
σy
σx
= Rx,y
S y
Sx
(3.2),
a = M y - b M x (3.3)

The coefficient calculated using formula (3.2) b called the linear regression coefficient. In some sources a is called a constant regression coefficient and b according to the variables.

Errors in predicting Y for a given value X are calculated using the formulas:

The quantity σ y/x (formula 3.4) is also called residual standard deviation, it characterizes the departure of the value Y from the regression line described by equation (3.1) for a fixed (given) value of X.

.
S y 2 / S x 2 = 0.20538 / 0.66481 = 0.30894. Let's take the square root of the last number and get:
S y / S x = 0.55582

3.3 Let's calculate the coefficient b according to formula (3.2)

b = -0.72028 0.55582 = -0.40035

3.4 Let's calculate the coefficient a according to formula (3.3)

a = 30.50000 - (-0.40035 25.75000) = 40.80894

3.5 Let’s estimate the errors of the regression equation.

3.5.1 Taking the square root of S y 2 we get:

= 0.31437
3.5.4 Let's calculate the relative error using formula (3.5)

δ y/x = (0.31437 / 30.50000)100% = 1.03073%

4. We build a scatter diagram (correlation field) and a regression line graph.

A scatterplot is a graphical representation of corresponding pairs (x k, y k) as points on a plane, in rectangular coordinates with the X and Y axes. The correlation field is one of the graphical representations of a related (paired) sample. The regression line graph is also plotted in the same coordinate system. Scales and starting points on the axes should be chosen carefully to ensure that the diagram is as clear as possible.

4.1. Find the minimum and maximum element of the sample X is the 18th and 15th element, respectively, x min = 22.10000 and x max = 26.60000.

4.2. We find the minimum and maximum element of the sample Y are the 2nd and 18th elements, respectively, y min = 29.40000 and y max = 31.60000.

4.3. On the x-axis, select a starting point slightly to the left of the point x 18 = 22.10000, and such a scale that the point x 15 = 26.60000 fits on the axis and the remaining points are clearly visible.

4.4. On the ordinate axis, select a starting point slightly to the left of the point y 2 = 29.40000, and such a scale that the point y 18 = 31.60000 fits on the axis and the remaining points are clearly distinguishable.

4.5. We place x k values ​​on the abscissa axis, and y k values ​​on the ordinate axis.

4.6. We plot the points (x 1, y 1), (x 2, y 2),…, (x 26, y 26) on the coordinate plane. We get the scatter diagram (correlation field) shown in the figure below.

4.7. Let's draw a regression line.

To do this, we will find two different points with coordinates (x r1, y r1) and (x r2, y r2) satisfying equation (3.6), plot them on the coordinate plane and draw a straight line through them. As the abscissa of the first point, we take the value x min = 22.10000. Substituting the value x min into equation (3.6), we obtain the ordinate of the first point. Thus, we have a point with coordinates (22.10000, 31.96127). In a similar way, we obtain the coordinates of the second point, putting the value x max = 26.60000 as the abscissa. The second point will be: (26.60000, 30.15970).

The regression line is shown in the figure below in red

Please note that the regression line always passes through the point of the average values ​​of X and Y, i.e. with coordinates (M x , M y).

06.06.2018 16 235 0 Igor

Psychology and Society

Everything in the world is interconnected. Each person, at the level of intuition, tries to find relationships between phenomena in order to be able to influence and control them. The concept that reflects this relationship is called correlation. What does it mean in simple words?

Content:

Concept of correlation

Correlation (from the Latin “correlatio” - ratio, relationship)– a mathematical term that means a measure of statistical probabilistic dependence between random quantities (variables).



Example: Let's take two types of relationships:

  1. First- a pen in a person’s hand. In which direction the hand moves, in that direction the pen goes. If the hand is at rest, then the pen will not write. If a person presses it a little harder, the mark on the paper will be richer. This type of relationship reflects a strict dependence and is not correlational. This relationship is functional.
  2. Second type– the relationship between a person’s level of education and reading literature. It is not known in advance which people read more: those with or without higher education. This connection is random or stochastic; it is studied by statistical science, which deals exclusively with mass phenomena. If a statistical calculation makes it possible to prove the correlation between the level of education and reading literature, then this will make it possible to make any forecasts and predict the probabilistic occurrence of events. In this example, with a high degree of probability, it can be argued that people with higher education, those who are more educated, read more books. But since the connection between these parameters is not functional, we may be mistaken. You can always calculate the probability of such an error, which will be clearly small and is called the level of statistical significance (p).

Examples of relationships between natural phenomena are: the food chain in nature, the human body, which consists of organ systems that are interconnected and function as a single whole.

Every day we encounter correlations in everyday life: between the weather and a good mood, the correct formulation of goals and their achievement, a positive attitude and luck, a feeling of happiness and financial well-being. But we are looking for connections, relying not on mathematical calculations, but on myths, intuition, superstitions, and idle speculation. These phenomena are very difficult to translate into mathematical language, express in numbers, and measure. It’s another matter when we analyze phenomena that can be calculated and presented in the form of numbers. In this case, we can define correlation using the correlation coefficient (r), which reflects the strength, degree, closeness and direction of the correlation between random variables.

Strong correlation between random variables- evidence of the presence of some statistical connection specifically between these phenomena, but this connection cannot be transferred to the same phenomena, but for a different situation. Often, researchers, having obtained a significant correlation between two variables in their calculations, based on the simplicity of correlation analysis, make false intuitive assumptions about the existence of cause-and-effect relationships between characteristics, forgetting that the correlation coefficient is probabilistic in nature.

Example: the number of people injured during icy conditions and the number of road accidents among motor vehicles. These quantities will correlate with each other, although they are absolutely not interconnected, but only have a connection with the common cause of these random events - black ice. If the analysis does not reveal a correlation between phenomena, this is not yet evidence of the absence of dependence between them, which may be complex nonlinear and not revealed by correlation calculations.




The first to introduce the concept of correlation into scientific use was the French paleontologist Georges Cuvier. In the 18th century, he deduced the law of correlation of parts and organs of living organisms, thanks to which it became possible to restore the appearance of the entire fossil creature, animal, from the found parts of the body (remains). In statistics, the term correlation was first used in 1886 by an English scientist Francis Galton. But he could not derive the exact formula for calculating the correlation coefficient, but his student did it - famous mathematician and biologist Karl Pearson.

Types of correlation

By importance– highly significant, significant and insignificant.

Kinds

what is r equal to

Highly significant

r corresponds to the level of statistical significance p<=0,01

Significant

r corresponds to p<=0,05

Insignificant

r does not reach p>0.1

Negative(a decrease in the value of one variable leads to an increase in the level of another: the more phobias a person has, the less likely he is to occupy a leadership position) and positive (if an increase in one variable leads to an increase in the level of another: the more nervous you are, the more likely you are to get sick). If there is no connection between the variables, then such a correlation is called zero.

Linear(when one value increases or decreases, the second also increases or decreases) and nonlinear (when when one value changes, the nature of the change in the second cannot be described using a linear relationship, then other mathematical laws are applied - polynomial, hyperbolic relationships).

By strength.

Odds




Depending on which scale the variables under study belong to, different types of correlation coefficients are calculated:

  1. The Pearson correlation coefficient, pairwise linear correlation coefficient, or product moment correlation is calculated for variables with interval and scale measurement scales.
  2. Spearman or Kendall rank correlation coefficient - when at least one of the quantities has an ordinal scale or is not normally distributed.
  3. Point biserial correlation coefficient (Fechner sign correlation coefficient) – if one of the two quantities is dichotomous.
  4. Four-field correlation coefficient (multiple rank correlation (concordance) coefficient – ​​if two variables are dichotomous.

The Pearson coefficient refers to parametric correlation indicators, all others are non-parametric.

The correlation coefficient value ranges from -1 to +1. With a complete positive correlation, r = +1, with a complete negative correlation, r = -1.

Formula and calculation





Examples

It is necessary to determine the relationship between two variables: the level of intellectual development (according to the testing) and the number of delays per month (according to entries in the educational journal) among schoolchildren.

The initial data is presented in the table:

IQ data (x)

Data on the number of delays (y)

Sum

1122

Average

112,2


To give a correct interpretation of the obtained indicator, it is necessary to analyze the sign of the correlation coefficient (+ or -) and its absolute value (modulo).

In accordance with the table of classification of the correlation coefficient by strength, we conclude that rxy = -0.827 is a strong negative correlation. Thus, the number of schoolchildren being late has a very strong dependence on their level of intellectual development. It can be said that students with a high IQ level are late for classes less often than students with a low IQ level.



The correlation coefficient can be used both by scientists to confirm or refute the assumption of the dependence of two quantities or phenomena and measure its strength and significance, and by students to conduct empirical and statistical research in various subjects. It must be remembered that this indicator is not an ideal tool; it is calculated only to measure the strength of a linear relationship and will always be a probabilistic value that has a certain error.

Correlation analysis is used in the following areas:

  • economic science;
  • astrophysics;
  • social sciences (sociology, psychology, pedagogy);
  • agrochemistry;
  • metallurgy;
  • industry (for quality control);
  • hydrobiology;
  • biometrics, etc.

Reasons for the popularity of the correlation analysis method:

  1. The relative simplicity of calculating correlation coefficients does not require special mathematical education.
  2. Allows you to calculate the relationships between mass random variables, which are the subject of analysis in statistical science. In this regard, this method has become widespread in the field of statistical research.

I hope that now you will be able to distinguish a functional relationship from a correlational relationship and will know that when you hear on television or read in the press about correlation, it means a positive and fairly significant interdependence between two phenomena.