360-Degree Feedback Reliability | Research Paper Review

Sunday, 3 March 2019

360 degree feedback blog image CR SystemsWe have delved into just one aspect of 360 degree feedback reliability (inter-rater reliability) to give provide clarity on 360 Degree Feedback processes.

Please get in touch if there is any area of 360 degree feedback you’d like to discuss.

Our team at CR Systems have taken this whitepaper report by Gary J. Greguras of Louisiana State University and Chet Robie of the University of Houston and compiled some of the key pieces of information to create a more digestible piece of content that can be used by professionals in any industry.

The study investigated the reliability of supervisor, peer, and subordinate feedback ratings made for managerial development.

Implications for the validity, design, and maintenance of 360-degree feedback systems are discussed along with directions for future research in this area.

Raters provided 360-degree feedback ratings on a sample of 153 managers. Using generalizability theory (a statistical framework for conceptualizing, investigating, and designing reliable observations), results indicated two core findings:

● Multi-source and multi-rater feedback is more reliable and valid than that of singular feedback
● Supervisors are the most reliable, with trivial differences between peers and subordinates

Providing Clarity On 360 Degree Feedback

Increasingly, organizations are implementing 360-degree feedback systems as part of their management development programs. Although many organizations have already implemented such systems, research has lagged behind, forcing practitioners to rely on “personal experience and/or trial-and-error approaches” (Church & Bracken, 1997, p. 151). This study was conducted to provide information to practitioners and researchers of 360-degree feedback systems faced with the questions:
● What factors contribute to the lack of reliability in feedback ratings?
● How many raters and items are required from different rater sources to achieve acceptable levels of reliability?
● Which rater source is the most reliable when controlling for the number of raters and items?
● Which rater source is the most reliable under conditions commonly encountered in practice?
The study used the generalizability theory to answer these questions by investigating the within-source interrater reliability of supervisor, peer, and subordinate feedback options made exclusively for managerial development.

Why The Generalizability Theory?

Whilst research surrounding the generalisability theory dates back to the 1990’s, the studies produced by Brennan (1992) were of such significant value they have steered all future studies around the generalisability theory.

Generalisability theory is based on the analysis of variance and serves as a framework for examining the reliability of behavioural measurements. The major advantage of using generalisability theory instead of classical test theory is that generalisability theory allows decision makers to estimate multiple sources of error variances.

Because classical test theory cannot partition error variance into multiple facets, it is unable to answer the question, If there are n raters and items, what is the estimated level of reliability?

Because it is unable to answer such questions, classical test theory is unable to provide practical recommendations to practitioners as to whether and how much they should increase the number of raters, the number of items, or some combination of both in order to maximize reliability. For a more detailed discussion on generalisability theory and comparisons between generalisability theory and classical test theory, see Brennan (1992) and Shavelson and Webb (1991).

Who Has Already Studied 360-Degree Feedback?

Two studies have been published that investigated within-source reliability of 360 degree feedback using generalisability theory on multisource data (see Kraiger & Teachout, 1990; W ebb, Shavelson, Kim, & Chen, 1989). These studies differ from the present study in several ways. Firstly, neither of these studies included subordinate ratings, whereas the current study includes subordinates as a rating source. Second, the reliabilities calculated in these studies, except for peers in the Webb et al. study, are interrater reliabilities (i.e., one rater). In contrast, the current study has multiple raters for each rating source, which allows for the calculation of interrater reliabilities. Third, the samples used in these two generalizability studies were comprised of personnel in entry-level enlisted military occupations, whereas the current study uses a sample of managers.

This distinction is important because most 360-degree feedback programs are designed for enhancing leadership and management capabilities. Fourth, these previous studies collected ratings to appraise performance for research purposes; the current study collected ratings specifically for the purpose of developmental feedback—the primary purpose of 360-degree feedback systems.

In addition to the above-mentioned advantages of using generalisability theory instead of classical test theory, the current study also uses generalisability theory to extend the literature on within-source interrater reliability by simultaneously comparing interrater reliabilities of supervisor, peer, and subordinate ratings. Research investigating within-source interrater reliability generally has found that supervisors tend to have higher interrater reliability than peers or subordinates (Ariani, W. 2013).

Although considerable research has focused on the interrater reliability of supervisor ratings, less is known about peer and subordinate ratings. Research comparing peer and subordinate interrater reliabilities have produced divergent results, a strength of the current study is that it compares the within-source interrater reliabilities of ratings made by supervisors, peers, and subordinates in the same study, and therefore ratees, items, and rating purpose are constant across rating sources, thereby eliminating any confounds attributable to these variables.

Data was collected from a wide range of American managers who had participated in leadership development programs in which a common 360-degree feedback instrument was used. Within this selection of participants, there was a mixture of ages, genders and racial backgrounds.

Benchmarks is a multi-rater feedback instrument used to measure managerial strengths and weaknesses. Benchmarks contain several sections, including a section called Managerial Skills and Perspectives, comprising 16 scales which were included in this study. All items use a 5- point Likert-type scale ranging from 1 (not at all) to 5 (to a very great extent).

Overview of Design
Ratings were analysed separately by source and by scale. Raters and items were treated as random. Rating source was not included as a facet because:
(a) a fundamental assumption of 360-degree feedback systems is that each rating source provides the rater with unique information (e.g., Hazucha, Hezlett, & Schneider, 1993), and, therefore, high interrater reliability of ratings between sources should not be expected (for a discussion, see Bozeman, 1997)
(b) results from each rating source are presented to the rater separately; and (c) the purpose of this study was to investigate within-source variability. Rating scale was not included as a facet because the scales were developed to be unidimensional, and results are presented to ratees at the scale level. Instead, results were collapsed across rating scales so that we could illustrate the reliability of the average 360-degree feedback scale across different measurement conditions.

What Did The Study Analyse?

The study analysed supervisor, peer, and subordinate ratings by scale, using a two-facet approach.


What Factors Contribute to the Lack of Reliability in Feedback Ratings?
Using the generalisability theory, the first purpose of this study was to estimate multiple sources of error variance in 360-degree feedback ratings. For all three rating sources, little variance was associated with the item effect opportunity to observe, contribute significantly to the variability of 360-degree feedback ratings for all three rating sources.

How Many Raters and Items Are Required to Achieve Acceptable Levels of Reliability?
The second purpose of this study was to project estimated reliabilities under multiple measurement conditions. Results indicate that the 360-degree feedback ratings start to reach the acceptable level of reliability for supervisors at 4 raters and 5 items; for peers at six raters and 20 items and for subordinates at seven raters and 20 items. When a scale has 5 items, 4 supervisors, 8 peers, and 9 subordinates are required to achieve acceptable levels of reliability.

Which Rater Source Is the Most Reliable When Controlling for the Number of Raters and Items?
The third purpose of the study was to examine the reliability between supervisors, peers, and subordinates while controlling for the number of raters and items. Results indicate that supervisors are the most reliable, with trivial differences between peers and subordinates. These findings are consistent with past research.

Which Rater Source Is the Most Reliable Under Conditions Commonly Encountered in Practice?
The reliabilities for all three sources at one rater and one item are extremely low and not likely to provide valid or useful information to ratees. Fortunately, in practice, there is generally more than one rater available for peers and subordinates, and at times for supervisors. atees were rated on average by one supervisor, four peers, and three subordinates. These results clearly indicate that in practice, given the differential number of raters employed between sources, supervisors have the lowest interrater reliability while peers were the most reliable.

Although reliability does not necessitate validity, these results suggest that researchers and practitioners may be well advised to seek, and perhaps weigh more heavily from peers and subordinates.

What Does This Mean For 360 Degree Feedback Tools?
Multisource feedback systems are designed to provide ratees with information from different sources to help them understand how they are viewed by others and to ultimately improve performance.

Future research should explore characteristics of ratees and raters (e.g., demographic or attitudinal similarities, or rater-ratee work relationships) that may help explain why raters disagree about the performance level or rank-order of ratees.

The study’s second purpose was to project estimated reliabilities under multiple measurement conditions. These results illustrate that the average number of raters generally used to provide feedback to ratees is too low. Because reliability sets an upper limit on validity, the information that is presented back to ratees may not be very useful.

We of course cannot expect managers to change their behaviors in appropriate ways if we provide them with unreliable information. At CR Systems, our 360 Degree softwares are designed to retrieve the highest quality and reliable feedback.

The third purpose of the study was to compare reliability estimates of ratings made by supervisors, peers, and subordinates. When the numbers of raters and items are held constant across the three rating sources, supervisors were the most reliable with trivial differences between peers and subordinates. There are several reasons why supervisors may be the most reliable when the numbers of raters and items are held constant.

● Supervisors may share common values
● Supervisors pay attention and provide relevant feedback
● They likely have more experience at observing relevant information
● May experience different opportunities to observe the ratee’s behaviours

CR Systems create the most advanced and bespoke 360-degree feedback software for all industries. If you are interested in getting started, contact our friendly staff today for a free consultation, with no commitment! Our trained consultants can go through the details and specific needs of your project with you over the phone.

Leave a comment