Standard-Setting Methods: Normative and Absolute

In this post we review the various methodologies for setting standards.

Berk (1986) identified 38 different methods for setting performance standards or adjusting them. Several classification schemas have been developed for these methods. The most straightforward classification divides the methods into two categories titled “normative methods” and “absolute methods” (Mills & Melican, 1988). Normative methods identify the standard based on the performance of a group of examinees. Using these methods a researcher can identify a given number or percentage of the test takers that will pass the exam. For example, the standard could be identified as being the top 10% of the students taking the exam. Normative methods are useful when a decision requires an exact number of test takers to pass the exam, for example a computer company needs 10 employees. The standard is set at the lowest score achieved by the top 10 highest scoring test candidates.

Absolute methods establish a standard independent of the distribution curve of examinee test scores and have no preset limits regarding the number of persons who can pass the exam. Absolute methods can further be divided into two groups: test-centered methods and examinee-centered methods (Mills & Melican, 1988). Both test-centered methods and examinee-centered methods have several techniques or methods of setting a cutscore that fall within these groups. See Figure 1. Test-centered methods require judges to review the items on the test and decide at what level a minimally qualified candidate would perform based on these items. There are many different test-centered methods that have been developed to facilitate this decision by the judges. The most common of these methods is the Angoff method named after W. H. Angoff (Norcini & Shea, 1997).

Figure 1. Families of cutscore methodologies.

Examinee-centered methods also use judges, but instead of evaluating items on the test, judges evaluate the examinees. A common method in this grouping is the borderline method. Using this method judges identify the candidates that are borderline qualified to pass the test. The median score on the test of the borderline candidates is used as the standard.

Advantages and Disadvantages of Each Type of Method

Since there is not a “correct” standard to which each of these types of methodologies can be compared, it is impossible to declare any as the best type of methodology (Sireci, Robin, & Patelis, 1999). The best that can be done is to understand the advantages and disadvantages of each type of methodology and select the methodology that best fits the testing situation.

Normative Methods

An advantage of normative methods is the ability to set beforehand the number of persons who will pass the exam, since allowing more people to pass the exam than are actually needed may cloud any decisions to be made or even be unethical.

Since the normative methodology uses the distribution of scores to establish the standard, a major drawback to this methodology is the lack of guarantee of the level of skills or knowledge of those passing. Even though a person is in the top 10% of those taking the test, the chance still remains that he or she does not have the knowledge necessary. Conversely, individuals with a high level of skill and/or knowledge may not pass if the scores of other test takers were extremely high (Mills & Melican, 1988).

Absolute Methods: Test-Centered Methods

Since standards set using absolute methods are not selected from the distribution of scores, these methods can provide a sense of assurance that an individual who receives a passing score based on these methods likely has a sufficient level of skill or knowledge in the area being tested.

Test-centered methods are the most common of the standard-setting methods. Their popularity is due to the straightforward nature of the methods. Test-centered methods are easy to explain to judges and stakeholders and the analysis is equally simple. These methods require the judges to predict the performance of a minimally qualified candidate on the test items. The two largest drawbacks to these methods are the difficulty in conceptualizing a minimally qualified candidate, and the difficulty judges have in predicting item performance.

Conceptualizing a Minimally Qualified Candidate

Since judges are asked to predict the performance of minimally qualified candidates, the ability to effectively and unanimously conceptualize a minimally qualified candidate can have a great effect on the validity of a standard. “Without a common understanding of the process and a common definition of minimal competence, differences in item ratings may be more related to background variables of judges than to real differences in perceived item difficulty” (Mills, Melican, & Ahluwalia, 1991, p. 7). Unfortunately, most standard-setting sessions spend little time or effort establishing a concrete description of a minimally qualified candidate. Far too often judges are only told to conceptualize a minimally qualified candidate, or the definition they are given is very vague and hard to uniformly conceptualize. This approach allows the past experiences of each judge to affect the conceptualization they come up with for a minimally qualified candidate. The training the judges receive on the description or characteristics of a minimally qualified candidate needs to ensure that differences in the item ratings are due to differences in item difficulty and not in the individual conceptualizations of a minimally qualified candidate (Mills et al.)

Results of a study by Livingston and Zieky (1989) appear to have been affected by the lack of a clear, well defined definition of minimal competency. The study used teachers from several different schools to establish standards for basic skills assessment tests in reading and mathematics. Judges were told that someone with a mastery level was someone who had the “ability to perform adequately the reading and mathematical tasks of adult life in modern American society” (p. 124). According to the authors, “these tasks were not specified or enumerated” (p. 124). Depending on the background of the judges, an “adequate level” could have very different meanings. In their discussion section the authors note the possibility that the “teachers at the schools with more able students envision a different type of adult life for their students than do the teachers at schools where students are less able” (Livingston & Zieky, p. 136). Clearly, standard-setting efforts involving heterogeneous judges ought to spend a lengthy amount of time ensuring that all judges have the same conceptualization and understanding of the minimally qualified candidate.

Predicting Item Performance

Test-centered methods require judges to predict how borderline qualified examinees will perform on each item. Several studies have shown that this is a very difficult, if not impossible task for the judges to accomplish. Judges appear to be able to rank order items, but struggle to identify accurate estimates of item difficulty (Impara & Plake, 1998). In response to this potential drawback to this methodology, Impara and Plake conducted a study to investigate the ability of judges to estimate item performance. The judges selected for this study were sixth-grade science teachers. Impara and Plake reasoned that judges such as these would be better qualified to predict item performance compared to typical judges chosen for standard-setting exercises since these judges were very familiar with both the performance of the students and the test they would be judging. The researchers defined the borderline group for the teachers as the students in the D/F range in the science classes the teachers were teaching. Before estimating the proportion of borderline students who would get each item correct, the teachers were asked to assign each student a grade to allow the researchers to verify how the borderline group actually did on the items. After the teachers made their estimations it was found that the sixth-grade science teachers underestimated the performance of the borderline group of students on the sixth-grade year-end exam. The authors concluded that judges were unable to effectively predict item performance for the borderline students. The findings of this study may have also suffered from a lack of a clear borderline group definition. Since each teacher/class has a different level of rigor, a student who would be a D/F student in one class may not be in another. Hence the definition and conceptualization of the borderline group was likely not uniform across all teachers. This study also did not incorporate the use of several characteristics typically used in this type of a methodology, including group discussion of results and the use of examinee performance data.

Absolute Methods: Examinee-Centered Methods

Examinee-centered methods require the judges to evaluate individuals taking the exam. Examinees are placed in categories by the judges based on the judge's perception of the examinee's achievement in the content area covered by the exam. The score distributions of each of these categories are then used to establish the standard (Mills & Melican, 1988).

Examinee-centered methods suffer from many of the same disadvantages as the test-centered methodologies, namely, the ability of the judges to conceptualize the categories the examinees are to be placed into. Examinee as well as test-centered methodologies also suffer from issues relating to the selection, training, and consistency of the judges. There are several advantages of this method, for example, pass/fail consistency statistics can be calculated and judges do not have to predict item performance. Several authors have noted that the use of score distributions to set the standard can appear to be more objective than the test-centered methods. This appearance of increased objectivity is an “illusion of objectivity.… Judgments about examinees furnish the foundation for the statistical estimation of classification…” (Berk, 1986, p. 153).

REFERENCES

Berk, R. A. (1986). A consumer's guide to setting performance standards on criterion-referenced tests. Review of Educational Research, 56(1), 137-172.

Impara, J. C., & Plake, B. S. (1998). Teacher's ability to estimate item difficulty: A test of the assumptions in the Angoff standard-setting method. Journal of Educational Measurement, 35(1), 69-81.

Livingston, S. A., & Zieky, M. J. (1989). A comparative study of standard-setting methods. Applied Measurement In Education, 2(2), 121-141.

Mills, C. N., & Melican, G. J. (1988). Estimating and adjusting cutoff scores: Features of selected methods. Applied Measurement in Education, 1(3), 261-275.

Mills, C. N., Melican, G. J., & Ahluwalia, N. T. (1991). Defining minimal competence. Educational Measurement: Issues and Practice, 10(1), 7-10.

Norcini, J. J., & Shea, J. A. (1997). The credibility and comparability of standards. Applied Measurement in Education, 10(1), 39-59.

Sireci, S. G., Robin, F., & Patelis, T. (1999). Using cluster analysis to facilitate standard-setting. Applied Measurement in Education, 12(3), 301-325.