EDUR 7130
Educational Research On-Line

Reliability


Reliability

In measurement, reliability refers to the ability to measure something consistently -- to obtain consistent scores every time something is measured.

Suppose, for example, that I have bathroom scales at home. One morning I step on the scales, record my weight, then step off. I immediately repeat this process two more times. My recorded weights are:

148, 286, 205

Are these measurements reliable?

I try weighing myself in the same manner with a second set of scales, and my recorded weights are:

195, 210, 205

Are these measurements reliable? Are they more reliable than the first set of measurements?

I try weighing myself in the same manner with a third set of scales, and my weights are:

204, 204, 203

Are these measurements reliable? As you can see, reliability comes in degrees; some measurements are more reliable than others. In this example, the third scale is more reliable than the second, and the second is more reliable than the first.

Why do we need reliability? Because almost all variables are measured with error, and calculation of reliability provides an index signaling the amount of error in measurement. The more reliable, the less error, and the less reliable, the more error.

Manifest and Latent Variables

Measurement specialists recognize that when something is measured, such as weight, that measurement is subject to error. The error may be small or large, but there will likely be error no matter how precisely it is measured. When measuring something well defined like weight, we can use standards that are widely accepted such as pounds or kilograms, and this helps to reduce error, but still there will be error since the measuring devices will lose calibration over time.

In education and the social sciences measurement error is especially critical when trying to measure complex constructs like self-efficacy, happiness, stress, anxiety, and achievement for which there are typically no precisely defined and universally agreed upon scales like with weight. So what is a construct and how does it have measurement error?

Earlier we learned about independent (IV) and dependent variables (DV). These terms help us communicate the role of variables within a model, e.g., academic achievement (DV) depends in part on self-regulated learning behavior (IV) and motivation (IV). Variables can also be identified as manifest or latent.

Manifest variables, loosely described, are those that can be directly observed or measured. For example, we can directly measure one’s height or weight, or one can directly report one’s age or income.

Latent variables are those that are not so easily observed or measured. Examples include stress, general self-efficacy, workplace autonomy, life satisfaction, and test anxiety. To measure latent variables researchers often use constructs.

Constructs are variables created by taking composite scores from indicators that are designed to measure a latent variable. For example, in measuring test anxiety we recognize there are several dimensions of this variable. For illustration we will consider two primary dimensions: physiological (also called somatic or emotionality), which are physical reactions (e.g. sweating, headache, upset stomach, rapid heartbeat, feeling of dread), and cognitive, which refers to thoughts (e.g., expecting failure, negative thoughts, frustration, comparing oneself to others negatively, feelings of inadequacy, self-condemnation).

To measure test anxiety, we may use a number of questionnaire items, which are typically called indicators when measuring a construct. To measure the physiological reaction that occurs during test anxiety, we might include the following indicators:

1. Immediately before or during tests you can feel your heart start to beat faster.
2. You get upset stomachs while taking tests.
3. When taking a test, you get a feeling of dread.

To measure the cognitive component of anxiety, we might use these three indicators:

4. While taking tests you think about how poorly you are doing.
5. You expect failure or poor grades when taking tests.
6. You become frustrated during testing.

The response scale for these six items could be a range from “Not at all like me” to “Very true of me.” Sample instructions for answering these items appear below.

Please indicate which number that best represents you on the following 7-point scale. Note the anchor descriptions for the scale: “Not at all like me” is 1 and “Very true of me” is 7.

Not at all like me |     1   2   3   4   5   6   7    | Very true of me

Responses to these six anxiety indicators would be summed or averaged to form a composite measure of test anxiety for each respondent. For example, suppose one student provides the following responses:

1. Heart beats faster = 2
2. Upset stomach = 3
3. Feel dread = 2
4. Think of poor performance = 2
5. Expect failure = 1
6. Frustrated = 1

The sum of these six scores is 2+3+2+2+1+1 = 11, and the mean is 11 / 6 = 1.83. On a scale from 1 to 7, this student reports a test anxiety level of 1.83, which suggests low levels of anxiety during tests.

This composite score of 1.83 represents the measurement of the construct test anxiety, a latent variable, for this student.

True Scores

When we attempt to measure something, like test anxiety, we understand that the score we observe, the observed score X, is made of two parts, a true score (T) and error (E):

X = T + E

We would like to know how much error, E, is included when we use observed scores, X, because the more error, the worse our measurement and the less confidence we have that X measures what we hope it measures.

When we examine scores across a sample of respondents, we typically find variation is scores. For example, the student illustrated above had a test anxiety score of 1.83. Certainly other students will have different test anxiety scores, maybe 3.34, 7.00, 1.17, 5.00. 4.50, etc. This means test anxiety scores will show variation, and this variation can be measured using the variance (recall variance and standard deviation from the Descriptive Statistics notes).

Since there is variability in test anxiety scores, we can say that the variance for test anxiety scores will be greater than 0.00. If we use the symbol X for test anxiety scores, we can indicate the variance like this:

VAR(X)

We can also expect variance in both true scores, T, and error in measurement, E, so we can symbolize these variances too:

VAR(T) and VAR(E)

Reliability is defined as the ratio of true score variance to the observed score variance:

Reliability = VAR(T) / VAR(X)

and since X = T + E, we can show that reliability is the ratio of true score variance to true score variance plus error variance:

Reliability = VAR(T) / (VAR(T) + VAR(E))

If there were no error in measurement, then VAR(E) would be zero, VAR(E) = 0.00, and reliability would be equal to 1.00:

Reliability is

= VAR(T) / (VAR(T) + VAR(E))

= VAR(T) / (VAR(T) + 0.00)

= VAR(T) / VAR(T) = 1.00

A reliability of 1.00 means no measurement error and therefore we have true scores. How do we estimate reliability?

Reliability Coefficients

To assess the degree of reliability, measurement specialists have developed methods to measure the reliability of a given set of scores. Typically the measurement of reliability is reflected in what is called a reliability coefficient. Reliability coefficients range from 1.00 (which is highest) to 0.00 (which is lowest). Reliability coefficients of .6 or .7 and above are considered good for classroom tests, and .9 and above is expected for professionally developed instruments. So the closer to 1.00 the coefficient of reliability, the more reliable the scores from an instrument or the more consistent scores obtained from an instrument.

Types of Reliability

There are several methods for assessing reliability; the most common are presented below.

(a) Test-Retest

Test-retest reliability is established by correlating scores obtained, on two separate occasions, from the same group of people on the same test. The correlation coefficient obtained is referred to as the coefficient of stability. With test-retest reliability, one attempts to determine whether consistent scores are being obtained from the same group of people over time; hence, one wish to learn whether scores are stable over time.

For example, one administers a test, say Test A, to students on June 1, then re-administers the same test (Test A) to the same students at a later date, say June 15. Scores from the same person are correlated to determine the degree of association between the two sets. Table 1 shows an example.

Table 1: Example of Test-Retest Scores for Reliability

Test Form A
Person June 1  Administration Scores June 15 Administration Scores
Bryan 85 83
Bob 75 77
Brenda 63 60
Bertha 59 57
Bert 91 89
Brent 35 40
Bathsheba 55 60
Beth 95 99
Bernie 86 83
Betty 83 77

The correlation between the sets of scores in Table 1 is r = .98, which indicates a strong association between the scores. Note that people who scored high on the first administration also scored high on the second, and those who scored low on first administration scored low on the second. There is a strong relationship between these two sets of scores, so high reliability.

The problem with test-retest reliability is that it is only appropriate for instruments for which individuals are not likely to remember their answers from administration to administration. Remembering answers will likely inflate, artificially, the reliability estimate. In general, test-retest reliability is not a very useful method for establishing reliability.

(b) Equivalent-forms

Equivalent-forms reliability is established in a manner similar to test-retest. Scores are obtained from the same group of people, but the scores are taken from different forms of a test. The different forms of the test (or instrument) are designed to measure the same thing, the same construct. The forms should be as similar as possible, but use different questions or wording. It is not enough to simply rearrange the item order; rather, new and different items are required between the two forms. Examples of parallel (equivalent) forms that most of you are familiar with include SAT, GRE, MAT, Miller's Analogy Test, and others. It is unlikely that should you take one of these standardized test more than once would you take the same form.

To establish equivalent forms reliability, one administers two forms of an instrument to the same group of people, take the scores and correlate them. The higher the correlation coefficient, the higher the equivalent forms reliability. Table 2 below illustrates this.

Table 2: Example of Equivalent-Forms Reliability

Test Form A
Person Instrument Form A Scores Instrument Form B Scores
Bryan 85 83
Bob 75 77
Brenda 63 60
Bertha 59 57
Bert 91 89
Brent 35 40
Bathsheba 55 60
Beth 95 99
Bernie 86 83
Betty 83 77

The scores are the same as given in Table 1, the only difference is that these scores come from two different instruments. The correlation between the sets of scores in Table 1 is r = .98, which indicates a strong association between the scores. Note that people who scored high on the Form A also scored high on Form B, and those who scored low on Form A scored low on Form B. There is a strong relationship between these two sets of scores, so high reliability.

Equivalent forms reliability is not a practical method for establishing reliability. One reason is due to the inability or difficulty in developing forms of an instrument that are parallel (equivalent). A second problem is the impractical aspect of asking study participants to complete two forms of an instrument. In most cases a researcher wishes to use instruments that are as short and to the point as possible, so asking one to complete more than one instrument is not often reasonable.

(c) Internal Consistency

This is the preferred method of establishing reliability for most measuring instruments. Internal consistency reliability represents the consistency with which items on an instrument provide similar scores. For example, suppose I develop three items to measure test anxiety. Test anxiety represents one's fear or over-concern for one's performance in a testing situation. In developing the three items, I should concentrate on three items that measure the same construct (test-anxiety), yet provide a slightly different view or angle on it. Below are three items to measure test anxiety that would be administered immediately before one takes a test:

  1. I have an uneasy, upset feeling.
  2. I'm concerned about doing poorly.
  3. I'm thinking of the consequences of failing.

For each item, I would ask the respondent to indicate on a scale how true each statement is at that time. For example:

Table 3: Test Anxiety Items

Instructions: Please indicate, on the scale provided, how true each statement is for you immediately before taking an important test.

  Not True of Me         Very True of Me
1. I have an uneasy, upset feeling. 1 2 3 4 5 6
2. I'm concerned about doing poorly. 1 2 3 4 5 6
3. I'm thinking of the consequences of failing. 1 2 3 4 5 6

If these three items show evidence of internal consistency, then a given person should show similar answers for each item. A person who has a high degree of of anxiety for testing situations would probably choose response 6 for item 1, response 6 for item 2, and response 6 for item 3. A person with little to no anxiety might choose response 1 for all three items. Note that both of these people show a high degree of consistency in their responses, and this internal consistency.

Defined, internal consistency is essentially the degree to which similar responses are provided for items designed to measure the same construct (variable), like test anxiety. In Table 4 another example is given for internal consistency. In this example, the survey is designed to measure satisfaction with a course. Let's suppose that a student is very dissatisfied with a course--the student hates the course. In response to question 1, the student is likely to select option 5 "always," for item 2 the student is likely to select option 5 "not at all." I think you can see the pattern here--the internally consistent pattern of selecting the negative responses.

Is there any item on this survey that is likely not to elicit a consistent pattern or response?

Table 4: Internal Consistency Example "The Course Satisfaction Survey"

1. Do your ever feel like skipping this class? never rarely sometimes often always
1 2 3 4 5
2. Do you like this class? very much quite fairly not too not at all
1 2 3 4 5
3. Do you like the way this class is taught? very much quite fairly not too not at all
1 2 3 4 5
4. Are you glad you chose or were assigned to be in this class? very glad most of the time sometimes not too often not at all
1 2 3 4 5
5. How much do you feel you have learned in this class? a great deal quite a bit a fair amount not much nothing
1 2 3 4 5
6. Do you always do your best in this class? always most of the time usually sometimes never
1 2 3 4 5
7. Do you like your other courses? very much quite a bit a fair amount not much not at all
1 2 3 4 5
8. Does the teacher give you help when you need it? always most of the time usually sometimes never
1 2 3 4 5
9. Do you find the time you spend in this class to be interesting? very much quite a bit a fair amount not much not at all
1 2 3 4 5

Adapted from B. W. Tuckman (1988). Conducting Educational Research (3rd ed.). New York: Harcourt, Brace, Jovanovich, p. 236.

Two ways of calculating internal consistency is split-half and Cronbach’s alpha (also known as KR-20 and KR-21). Alpha is the better of the two because it provides the average of all possible split-half reliabilities. All these measures of internal consistency provide an index that ranges from 0.00 to 1.00 with 1.00 indicating higher levels of internal consistency. In most research writers will report Cronbach's alpha like:

Cronbach's alpha was calculated for each subscale: test anxiety a = .76; academic self-efficacy, a = .54; and motivation to learn, a = .83.

Often students will overlook the use of Cronbach's alpha in research reports as a form of reporting internal consistency reliability. For example, consider this scenario:

Jones and Smith (1993) conducted research on academic self-efficacy among adolescences and found that self-efficacy scores, with a reported Cronbach's alpha of .87, correlated strongly with persistence, grades, self-selected tasks related to academics. Further research reported by Jones and Smith also denotes...

In the above report note that self-efficacy scores have an internal consistency of .87, as indicated by Cronbach's alpha.

 

(d) Scorer/Rater

Scorer/Rater reliability is used to determine the consistency in which one rater assigns scores (intra-judge), or the consistency in which two or more raters assign scores (inter-judge). Intra-judge reliability refers to a single judge assigning scores. Remember that consistency requires multiple scores (at least two) in order to establish reliability, so for intra-judge reliability to be established, a single judge or rater must score something more than once. If asked to judge an art exhibit, to establish reliability, a judge most rate the same exhibit more than once to learn if the judge is reliable in assigning scores. If the judge rates it high once and low the second time, obviously rater reliability is lacking.

For inter-judge reliability, one is concerned with showing that multiple raters have consistency in their scoring of something. As an example, consider the multiple judges used at the Olympics. For the high dive competition, often about seven judges are used. If the seven judges provide scores like 5.4, 5.2, 5.1, 5.5, 5.3, 5.4, and 5.6, then there is some consistency there. If, however, scores are something like 5.4, 4.3, 5.1, 4.9, 5.3, 5.4, and 5.6, then it is clear the judges are not using the same criteria for determining scores, so they lack consistency, reliability.