Kappa statistics are used to assess the agreement between two or more evaluators if the measurement scale is categorical. In this brief summary, we discuss and interpret the main features of kappa statistics, the influence of prevalence on kappa statistics and their usefulness in clinical research. We also introduce weighted kappa if the result is ordinal, and intraclass correlation to assess the match in a case where the data is measured on a continuous scale. One more thing to keep in mind. Most of the time, kappa works very well to measure matching. However, there is an interesting situation where the percentage of agreement is very high, but the kappa statistics are very low. This is called the kappa paradox. Most statistical software has the ability to calculate k. For simple datasets (i.e., two evaluators, two items) the calculation of k by hand is quite simple. For larger data sets, you`ll probably want to use software like SPSS. The accuracy of the observer affects the maximum kappa value. As the simulation results show, the kappa values of 12 codes appear to achieve an asymptote of about 0.60, 0.70, 0.80 and 0.90% accuracy, respectively. Calculation of percentage match (fictitious data).
A similar statistic, called Pi, was proposed by Scott (1955). Cohen`s Kappa and Scott`s Pi differ in terms of PE calculation. Why can`t we use these rules of thumb as clear cuts? As with everything related to statistics, your decision-making depends on the study and the goal. This is a simple procedure when the values are only zero and one and the number of data collectors is two. If there are more data collectors, the procedure is a little more complex (Table 2). However, as long as the scores are limited to only two values, the calculation remains easy. The researcher simply calculates the percentage match for each row and averages the rows. Another advantage of the matrix is that it allows the researcher to know whether the errors are random and therefore fairly evenly distributed among all evaluators and variables, or whether a particular data collector frequently records values that are different from other data collectors. Table 2, which shows an overall reliability of the evaluators of 90%, shows that no data collector had an excessive number of outliers (values that were not consistent with the majority of the evaluators` evaluations).
Another advantage of this technique is that it allows the researcher to identify variables that may be problematic. It should be noted that Table 2 shows that the evaluators obtained only a 60% match for variable 10. This variable may warrant verification to identify the cause of such a weak match in its assessment. Therefore, the kappa standard error for the data in Figure 3, P = 0.94, pe = 0.57 and N = 222 includes factors that affect kappa values, observer accuracy and number of codes, as well as individual code prevalence and observer distortion of codes. Kappa can only be equal to 1 if observers distribute the codes evenly. There is no value of kappa that can be considered universally acceptable; it depends on the accuracy of the observers and the number of codes. The Cohen-Kappa coefficient (κ) is a statistic used to measure inter-evaluator reliability (and also intra-evaluator reliability) for qualitative (categorical) elements. [1] It is generally thought that this is a more robust measure than simply calculating the percentage of agreement, since κ takes into account the possibility that the agreement will occur randomly. There is controversy around Cohen`s kappa due to the difficulty of interpreting correspondence clues. Some researchers have suggested that it is conceptually easier to assess disagreements between elements.
[2] For more information, see Restrictions. Step 1: Calculate po (the observed proportional match): 20 images were rated Yes by both. 15 photos were rated No by both. Thus, Po = matching number / total = (20 + 15) / 50 = 0.70. Weighted kappa allows for different weighting of disagreements[21] and is particularly useful when ordering codes. [8]:66 Three matrices are involved, the matrix of observed scores, the matrix of expected scores based on random matching and the matrix of weights. The cells of the weight matrix on the diagonal (top left to bottom right) represent a match and therefore contain zeros. Cells outside the diagonal contain weights that indicate the severity of this disagreement. Often, the cells of one of the diagonals are weighted with 1, these two with 2, etc. We find that in the second case, it shows a greater similarity between A and B compared to the first. Indeed, although the percentage match is the same, the percentage match that would occur “randomly” is significantly higher in the first case (0.54 compared to 0.46).
Where does this come from? Cohen suggested interpreting Kappa`s statistics as follows: Note that these guidelines may not be sufficient for health-related research and testing. Items such as X-rays and test results are often assessed subjectively. While the interracter match of 0.4 may be suitable for a general survey, it is usually too weak for something like cancer screening. Therefore, you usually want a higher level for acceptable interrater reliability when it comes to health. On the other hand, if there are more than 12 codes, the increment of the expected kappa value becomes flat. So if you just calculate the percentage of the agreement, it could have already been used to measure the degree of agreement. In addition, the increase in the values of the sensitivity performance measures also reaches the asymptotes of more than 12 codes. An example of the calculated kappa statistic can be found in Figure 3. Note that the percentage match is 0.94, while the kappa is 0.85 – a considerable reduction in congruence.
The greater the expected random agreement, the lower the resulting value of the kappa. Once the kappa is calculated, the researcher will probably want to assess the importance of the kappa obtained by calculating the confidence intervals for the kappa obtained. Percentage match statistics are a direct measure, not an estimate. So there is no need for confidence intervals. However, kappa is an estimate of the reliability of the intervaluor and the confidence intervals are therefore of greater interest. The groundbreaking work, which Kappa presented as a new technique, was published in 1960 by Jacob Cohen in the journal Educational and Psychological Measurement. [5] Why not just use percentage matching? Kappa`s statistics do not correct for random agreement and percentage match. Cohen`s Kappa statistics measure the reliability of evaluators (sometimes called an interobserver agreement). The reliability or accuracy of the interpreter occurs when your data evaluators (or collectors) give the same data element the same score. .