EDUR 7130
Educational Research On-Line
Reliability
Reliability
In measurement, reliability refers to the ability to measure something consistently -- to obtain consistent scores every time something is measured.
Suppose, for example, that I have bathroom scales at home. One morning I step on the scales, record my weight, then step off. I immediately repeat this process two more times. My recorded weights are:
148, 286, 205
Are these measurements reliable?
I try weighing myself in the same manner with a second set of scales, and my recorded weights are:
195, 210, 205
Are these measurements reliable? Are they more reliable than the first set of measurements?
I try weighing myself in the same manner with a third set of scales, and my weights are:
204, 204, 203
Are these measurements reliable? As you can see, reliability comes in degrees; some measurements are more reliable than others. In this example, the third scale is more reliable than the second, and the second is more reliable than the first.
Why do we need reliability? Because almost all variables are measured with error, and calculation of reliability provides an index signaling the amount of error in measurement. The more reliable, the less error, and the less reliable, the more error.
Manifest and Latent Variables
Measurement specialists recognize that when something is measured, such as
weight, that measurement is subject to error. The error may be small or large,
but there will likely be error no matter how precisely it is measured. When
measuring something well defined like weight, we can use standards that are
widely accepted such as pounds or kilograms, and this helps to reduce error, but
still there will be error since the measuring devices will lose calibration over
time.
In education and the social sciences measurement error is
especially critical when trying to measure complex constructs like
self-efficacy, happiness, stress, anxiety, and achievement for which there are
typically no precisely defined and universally agreed upon scales like with
weight. So what is a construct and how does it have measurement error?
Earlier we learned about independent (IV) and dependent variables (DV). These
terms help us communicate the role of variables within a model, e.g., academic
achievement (DV) depends in part on self-regulated learning behavior (IV) and
motivation (IV). Variables can also be identified as manifest or latent.
Manifest variables, loosely described, are those that can be directly
observed or measured. For example, we can directly measure one’s height or
weight, or one can directly report one’s age or income.
Latent variables
are those that are not so easily observed or measured. Examples include stress,
general self-efficacy, workplace autonomy, life satisfaction, and test anxiety.
To measure latent variables researchers often use constructs.
Constructs
are variables created by taking composite scores from indicators that are
designed to measure a latent variable. For example, in measuring test anxiety we
recognize there are several dimensions of this variable. For illustration we
will consider two primary dimensions: physiological (also called somatic or
emotionality), which are physical reactions (e.g. sweating, headache, upset
stomach, rapid heartbeat, feeling of dread), and cognitive, which refers to
thoughts (e.g., expecting failure, negative thoughts, frustration, comparing
oneself to others negatively, feelings of inadequacy, self-condemnation).
To measure test anxiety, we may use a number of questionnaire items, which
are typically called indicators when measuring a construct. To measure the
physiological reaction that occurs during test anxiety, we might include the
following indicators:
1. Immediately before or during tests you can feel
your heart start to beat faster.
2. You get upset stomachs while taking
tests.
3. When taking a test, you get a feeling of dread.
To measure
the cognitive component of anxiety, we might use these three indicators:
4. While taking tests you think about how poorly you are doing.
5. You expect
failure or poor grades when taking tests.
6. You become frustrated during
testing.
The response scale for these six items could be a range from
“Not at all like me” to “Very true of me.” Sample instructions for answering
these items appear below.
Please indicate which number that best represents you on the following 7-point scale. Note the anchor descriptions for the scale: “Not at all like me” is 1 and “Very true of me” is 7.
Not at all like me | 1 2 3 4 5 6 7 | Very true of me
Responses to these six anxiety indicators would be summed or averaged to form
a composite measure of test anxiety for each respondent. For example, suppose
one student provides the following responses:
1. Heart beats faster = 2
2. Upset stomach = 3
3. Feel dread = 2
4. Think of poor performance = 2
5. Expect failure = 1
6. Frustrated = 1
The sum of these six scores is
2+3+2+2+1+1 = 11, and the mean is 11 / 6 = 1.83. On a scale from 1 to 7, this
student reports a test anxiety level of 1.83, which suggests low levels of
anxiety during tests.
This composite score of 1.83 represents the
measurement of the construct test anxiety, a latent variable, for this student.
True Scores
When we attempt to measure something, like test anxiety, we understand that
the score we observe, the observed score X, is made of two parts, a true score
(T) and error (E):
X = T + E
We would like to know how much error,
E, is included when we use observed scores, X, because the more error, the worse
our measurement and the less confidence we have that X measures what we hope it
measures.
When we examine scores across a sample of respondents, we
typically find variation is scores. For example, the student illustrated above
had a test anxiety score of 1.83. Certainly other students will have different
test anxiety scores, maybe 3.34, 7.00, 1.17, 5.00. 4.50, etc. This means test
anxiety scores will show variation, and this variation can be measured using the
variance (recall variance and standard deviation from the Descriptive Statistics
notes).
Since there is variability in test anxiety scores, we can say
that the variance for test anxiety scores will be greater than 0.00. If we use
the symbol X for test anxiety scores, we can indicate the variance like this:
VAR(X)
We can also expect variance in both true scores, T, and error
in measurement, E, so we can symbolize these variances too:
VAR(T) and
VAR(E)
Reliability is defined as the ratio of true score variance to the
observed score variance:
Reliability = VAR(T) / VAR(X)
and since X
= T + E, we can show that reliability is the ratio of true score variance to
true score variance plus error variance:
Reliability = VAR(T) / (VAR(T) +
VAR(E))
If there were no error in measurement, then VAR(E) would be zero,
VAR(E) = 0.00, and reliability would be equal to 1.00:
Reliability is
= VAR(T) / (VAR(T) + VAR(E))
= VAR(T) / (VAR(T) + 0.00)
= VAR(T) / VAR(T) = 1.00
A reliability of 1.00 means no measurement error and therefore we have true scores. How do we estimate reliability?
Reliability Coefficients
To assess the degree of reliability, measurement specialists have developed methods to measure the reliability of a given set of scores. Typically the measurement of reliability is reflected in what is called a reliability coefficient. Reliability coefficients range from 1.00 (which is highest) to 0.00 (which is lowest). Reliability coefficients of .6 or .7 and above are considered good for classroom tests, and .9 and above is expected for professionally developed instruments. So the closer to 1.00 the coefficient of reliability, the more reliable the scores from an instrument or the more consistent scores obtained from an instrument.
Types of Reliability
There are several methods for assessing reliability; the most common are presented below.
(a) Test-Retest
Test-retest reliability is established by correlating scores obtained, on two separate occasions, from the same group of people on the same test. The correlation coefficient obtained is referred to as the coefficient of stability. With test-retest reliability, one attempts to determine whether consistent scores are being obtained from the same group of people over time; hence, one wish to learn whether scores are stable over time.
For example, one administers a test, say Test A, to students on June 1, then re-administers the same test (Test A) to the same students at a later date, say June 15. Scores from the same person are correlated to determine the degree of association between the two sets. Table 1 shows an example.
Table 1: Example of Test-Retest Scores for Reliability
Test Form A | ||
Person | June 1 Administration Scores | June 15 Administration Scores |
Bryan | 85 | 83 |
Bob | 75 | 77 |
Brenda | 63 | 60 |
Bertha | 59 | 57 |
Bert | 91 | 89 |
Brent | 35 | 40 |
Bathsheba | 55 | 60 |
Beth | 95 | 99 |
Bernie | 86 | 83 |
Betty | 83 | 77 |
The correlation between the sets of scores in Table 1 is r = .98, which indicates a strong association between the scores. Note that people who scored high on the first administration also scored high on the second, and those who scored low on first administration scored low on the second. There is a strong relationship between these two sets of scores, so high reliability.
The problem with test-retest reliability is that it is only appropriate for instruments for which individuals are not likely to remember their answers from administration to administration. Remembering answers will likely inflate, artificially, the reliability estimate. In general, test-retest reliability is not a very useful method for establishing reliability.
(b) Equivalent-forms
Equivalent-forms reliability is established in a manner similar to test-retest. Scores are obtained from the same group of people, but the scores are taken from different forms of a test. The different forms of the test (or instrument) are designed to measure the same thing, the same construct. The forms should be as similar as possible, but use different questions or wording. It is not enough to simply rearrange the item order; rather, new and different items are required between the two forms. Examples of parallel (equivalent) forms that most of you are familiar with include SAT, GRE, MAT, Miller's Analogy Test, and others. It is unlikely that should you take one of these standardized test more than once would you take the same form.
To establish equivalent forms reliability, one administers two forms of an instrument to the same group of people, take the scores and correlate them. The higher the correlation coefficient, the higher the equivalent forms reliability. Table 2 below illustrates this.
Table 2: Example of Equivalent-Forms Reliability
Test Form A | ||
Person | Instrument Form A Scores | Instrument Form B Scores |
Bryan | 85 | 83 |
Bob | 75 | 77 |
Brenda | 63 | 60 |
Bertha | 59 | 57 |
Bert | 91 | 89 |
Brent | 35 | 40 |
Bathsheba | 55 | 60 |
Beth | 95 | 99 |
Bernie | 86 | 83 |
Betty | 83 | 77 |
The scores are the same as given in Table 1, the only difference is that these scores come from two different instruments. The correlation between the sets of scores in Table 1 is r = .98, which indicates a strong association between the scores. Note that people who scored high on the Form A also scored high on Form B, and those who scored low on Form A scored low on Form B. There is a strong relationship between these two sets of scores, so high reliability.
Equivalent forms reliability is not a practical method for establishing reliability. One reason is due to the inability or difficulty in developing forms of an instrument that are parallel (equivalent). A second problem is the impractical aspect of asking study participants to complete two forms of an instrument. In most cases a researcher wishes to use instruments that are as short and to the point as possible, so asking one to complete more than one instrument is not often reasonable.
(c) Internal Consistency
This is the preferred method of establishing reliability for most measuring instruments. Internal consistency reliability represents the consistency with which items on an instrument provide similar scores. For example, suppose I develop three items to measure test anxiety. Test anxiety represents one's fear or over-concern for one's performance in a testing situation. In developing the three items, I should concentrate on three items that measure the same construct (test-anxiety), yet provide a slightly different view or angle on it. Below are three items to measure test anxiety that would be administered immediately before one takes a test:
For each item, I would ask the respondent to indicate on a scale how true each statement is at that time. For example:
Table 3: Test Anxiety Items
Instructions: Please indicate, on the scale provided, how true each statement is for you immediately before taking an important test.
Not True of Me | Very True of Me | |||||
1. I have an uneasy, upset feeling. | 1 | 2 | 3 | 4 | 5 | 6 |
2. I'm concerned about doing poorly. | 1 | 2 | 3 | 4 | 5 | 6 |
3. I'm thinking of the consequences of failing. | 1 | 2 | 3 | 4 | 5 | 6 |
If these three items show evidence of internal consistency, then a given person should show similar answers for each item. A person who has a high degree of of anxiety for testing situations would probably choose response 6 for item 1, response 6 for item 2, and response 6 for item 3. A person with little to no anxiety might choose response 1 for all three items. Note that both of these people show a high degree of consistency in their responses, and this internal consistency.
Defined, internal consistency is essentially the degree to which similar responses are provided for items designed to measure the same construct (variable), like test anxiety. In Table 4 another example is given for internal consistency. In this example, the survey is designed to measure satisfaction with a course. Let's suppose that a student is very dissatisfied with a course--the student hates the course. In response to question 1, the student is likely to select option 5 "always," for item 2 the student is likely to select option 5 "not at all." I think you can see the pattern here--the internally consistent pattern of selecting the negative responses.
Is there any item on this survey that is likely not to elicit a consistent pattern or response?
Table 4: Internal Consistency Example "The Course Satisfaction Survey"
1. Do your ever feel like skipping this class? | never | rarely | sometimes | often | always |
1 | 2 | 3 | 4 | 5 | |
2. Do you like this class? | very much | quite | fairly | not too | not at all |
1 | 2 | 3 | 4 | 5 | |
3. Do you like the way this class is taught? | very much | quite | fairly | not too | not at all |
1 | 2 | 3 | 4 | 5 | |
4. Are you glad you chose or were assigned to be in this class? | very glad | most of the time | sometimes | not too often | not at all |
1 | 2 | 3 | 4 | 5 | |
5. How much do you feel you have learned in this class? | a great deal | quite a bit | a fair amount | not much | nothing |
1 | 2 | 3 | 4 | 5 | |
6. Do you always do your best in this class? | always | most of the time | usually | sometimes | never |
1 | 2 | 3 | 4 | 5 | |
7. Do you like your other courses? | very much | quite a bit | a fair amount | not much | not at all |
1 | 2 | 3 | 4 | 5 | |
8. Does the teacher give you help when you need it? | always | most of the time | usually | sometimes | never |
1 | 2 | 3 | 4 | 5 | |
9. Do you find the time you spend in this class to be interesting? | very much | quite a bit | a fair amount | not much | not at all |
1 | 2 | 3 | 4 | 5 |
Adapted from B. W. Tuckman (1988). Conducting Educational Research (3rd ed.). New York: Harcourt, Brace, Jovanovich, p. 236.
Two ways of calculating internal consistency is split-half and Cronbachs alpha (also known as KR-20 and KR-21). Alpha is the better of the two because it provides the average of all possible split-half reliabilities. All these measures of internal consistency provide an index that ranges from 0.00 to 1.00 with 1.00 indicating higher levels of internal consistency. In most research writers will report Cronbach's alpha like:
Cronbach's alpha was calculated for each subscale: test anxiety a = .76; academic self-efficacy, a = .54; and motivation to learn, a = .83.
Often students will overlook the use of Cronbach's alpha in research reports as a form of reporting internal consistency reliability. For example, consider this scenario:
Jones and Smith (1993) conducted research on academic self-efficacy among adolescences and found that self-efficacy scores, with a reported Cronbach's alpha of .87, correlated strongly with persistence, grades, self-selected tasks related to academics. Further research reported by Jones and Smith also denotes...
In the above report note that self-efficacy scores have an internal consistency of .87, as indicated by Cronbach's alpha.
(d) Scorer/Rater
Scorer/Rater reliability is used to determine the consistency in which one rater assigns scores (intra-judge), or the consistency in which two or more raters assign scores (inter-judge). Intra-judge reliability refers to a single judge assigning scores. Remember that consistency requires multiple scores (at least two) in order to establish reliability, so for intra-judge reliability to be established, a single judge or rater must score something more than once. If asked to judge an art exhibit, to establish reliability, a judge most rate the same exhibit more than once to learn if the judge is reliable in assigning scores. If the judge rates it high once and low the second time, obviously rater reliability is lacking.
For inter-judge reliability, one is concerned with showing that multiple raters have consistency in their scoring of something. As an example, consider the multiple judges used at the Olympics. For the high dive competition, often about seven judges are used. If the seven judges provide scores like 5.4, 5.2, 5.1, 5.5, 5.3, 5.4, and 5.6, then there is some consistency there. If, however, scores are something like 5.4, 4.3, 5.1, 4.9, 5.3, 5.4, and 5.6, then it is clear the judges are not using the same criteria for determining scores, so they lack consistency, reliability.