In:Meaningful Language Test Scores: Research to enhance score interpretation
Edited by Spiros Papageorgiou and Venessa F. Manna
[Innovations in Language Learning and Assessment 1] 2023
► pp. 14–34
Chapter 2Considerations in developing vertical scales for language
tests
Published online: 29 June 2023
https://doi.org/10.1075/illa.1.02mon
https://doi.org/10.1075/illa.1.02mon
Abstract
This chapter provides a framework for building vertical
scales, for language assessments in general and for the
TOEFL® Family of Assessments in particular. Topics
covered include aspects of vertical scale design (growth definitions,
vertical articulation, data collection), statistical methods for
vertical linking, and evaluation and maintenance of the resulting
vertical scale. Also discussed are challenges associated with vertical
scaling, as noted in the research literature, in general and as pertains
to language proficiency assessments.
Article outline
- Introduction
- Vertical scale design
- Growth definitions
- Vertical articulation
- Data collection design
- Statistical methods for vertical linking
- Hieronymus scaling
- Thurstone scaling
- IRT scaling
- IRT scaling Decision 1: Choice of model
- IRT scaling Decision 2: Separate vs concurrent calibration
- IRT scaling Decision 3: Scores
- Evaluation of a vertical scale
- Maintenance of the vertical scale
- Challenges with vertical scaling
- Conclusion
References
References (47)
Bock, R. D., & Zimowski, M. F. (1997). Multiple
Group
IRT. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook
of modern item response
theory (pp. 433–448). Springer.
Braun, H. I. (1988). A
new approach to avoiding problems of scale in interpreting
trends in mental measurement
data. Journal of Educational
Measurement, 25(3), 171–191.
Briggs, D. C., & Domingue, B. (2013). The
gains from vertical
scaling. Journal of
Educational and Behavioral
Statistics, 38(6), 551–576.
Briggs, D. C., & Weeks J. P. (2009). The
impact of vertical scaling decisions on growth
interpretations. Educational
Measurement: Issues and
Practice, 28(4), 3–14.
Carlson, J. E. (2010). Statistical
models for vertical
linking. In A. A. von Davier (Ed.), Statistical
models for test equating, scaling, and
linking (pp. 59–70). Springer.
Crocker, L., & Algina, J. (1986). Introduction
to modern and classical test
theory. Holt, Rinehart, and Winston.
Deng, W., & Monfils, L. (2017). Long-term
impact of valid case criterion on capturing population-level
growth under item response theory
equating (ETS Research Report
Series No.
RR–17–17). ETS.
Haberman, S. J. (2012). A
general program for item-response analysis that employs the
stabilized Newton-Raphson
algorithm (Unpublished
manuscript). ETS.
Haebara, T. (1980). Equating
logistic ability scales by a weighted least squares
method. Japanese
Psychological
Research, 22, 144–149.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals
of item response
theory. Sage.
Hanson, B. A., & Beguin, A. A. (1999). Separate
versus concurrent estimation of IRT item parameters in the
common item equating design (ACT
Research Report
Series, 99–8). ACT.
(2002). Obtaining
a common scale for item response theory item parameters
using separate versus concurrent estimation in the
common-item equating
design. Applied Psychological
Measurement, 26(1), 3–24.
Harris, D. J. (1991). A
comparison of Angoff’s Design I and Design II for vertical
equating using traditional and IRT
methodology. Journal of
Educational
Measurement, 28(3), 221–235.
(2007). Practical
issues in vertical
scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking
and aligning scores and
scales (pp. 233–251). Springer.
Harris, D. J., & Hoover, H. D. (1987). An
application of the three-parameter IRT model to vertical
equating. Applied
Psychological
Measurement, 11(2), 151–159.
Holland, P. W. (2007). A
framework and history for score
linking. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking
and aligning scores and
scales (pp. 5–30). Springer.
Hoskens, M., Lewis, D. M., & Patz, R. J. (2003). Maintaining
vertical scalings using a common item
design. Paper presented at the
annual meeting of the National Council on Measurement in
Education, Chicago, IL.
Ito, K., Sykes, R. C., & Yao, L. (2008). Concurrent
and separate grade-groups linking procedures for vertical
scaling. Applied Measurement
in
Education, 21(3), 187–206.
Kenyon, D. M., MacGregor, D., Li, D., & Cook, H. G. (2011). Issues
in vertical scaling of a K–12 English language proficiency
test. Language
Testing, 28(3), 383–400.
Kim, S.-H., & Cohen, A. S. (1998). A
comparison of linking and concurrent calibration under item
response theory. Applied
Psychological
Measurement, 22(2), 131–143.
Kolen, M. J. (1981). Comparison
of traditional and item response theory methods of equating
tests. Journal of Educational
Measurement, 18(1), 1–11.
(2006). Scaling
and
norming. In R. L. Brennan (Ed.), Educational
measurement (4th
ed., pp. 156–186). American Council on Education; Praeger.
(2011). Issues
associated with vertical scales for PARCC
assessments. Retrieved on 6
February 2023 from [URL]
Kolen, M. J., & Brennan, R. L. (2014). Test
equating, scaling, and linking: Methods and
practices (3rd
ed.). Springer.
Linn, R. L. (1993). Linking
results of distinct
assessments. Applied
Measurement in
Education, 6(1), 83–102.
Lord, F. M. (1975). The
‘ability’ scale in item characteristic curve
theory. Psychometrika, 40(2), 205–217.
Martineau, J. A. (2006). Distorting
value added: The use of longitudinal, vertically scaled
student achievement data for growth-based, value-added
accountability. Journal of
Educational and Behavioral
Statistics, 31(1), 35–62.
Masters, G. N., & Wright, B. D. (1997). The
partial credit
model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook
of modern item response
theory (pp. 101–122). Springer.
Muraki, E. (1992). A
generalized partial credit model: Application of an EM
algorithm. Applied
Psychological
Measurement, 16(2), 159–176.
Patz, R. J., & Yao, L. (2007). Methods
and models for vertical
scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking
and aligning scores and
scales (pp. 253–272). Springer.
Peterson, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling,
norming, and
equating. In R. L. Linn (Ed.), Educational
measurement (3rd
ed., pp. 221–262). Macmillan.
(2010). Study
of best practices for vertical scaling and standard setting
with recommendations for FCAT 2.0. [URL]
Skaggs, G., & Lissitz, R. W. (1986). IRT
test equating: Relevant issues and a review of recent
research. Review of
Educational
Research, 56(4), 495–529.
(1988). Effect
of examinee ability on test equating
invariance. Applied
Psychological
Measurement, 12(1), 69–82.
Slinde, J. A., & Linn, R. L. (1979). A
note on vertical equating via the Rasch model for groups of
quite different ability and tests of quite different
difficulty. Journal of
Educational
Measurement, 16, 159–165.
Stocking, M. L., & Lord, F. M. (1983). Developing
a common metric in item response
theory. Applied Psychological
Measurement, 7(2), 201–210.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection
of differential item functioning using the parameters of
item response
models. In P. W. Holland & H. Wainer (Eds.), Differential
item
functioning (pp. 67–113). Lawrence Erlbaum Associates.
Tomkowicz, J., Zhang, L., & Yen, S. (2010). Comparison
of vertical scaling maintenance methods and their impact on
scale properties. Paper presented
at the annual meeting of the National Council on Measurement
in Education, Denver, CO.
Tong, Y., & Kolen, M. J. (2010). Scaling:
An ITEMS module. Educational
Measurement: Issues and
Practice, 29(4), 39–48.
von Davier, M. (2008). A
general diagnostic model applied to language testing
data. British Journal of
Mathematical and Statistical
Psychology, 61(2), 287–307.
Wu, R. Y., & Liao, C. H. (2010). Establishing
a common score scale for the GEPT Elementary, Intermediate,
and High-Intermediate Level listening and reading
tests. In T. Kao & Y. Li (Eds.), A
new look at language teaching and testing: English as
subject and vehicle – Selected papers from the 2009 LTTC
International Conference on English Language Teaching and
Testing (pp. 309–329). Language Training and Testing Center.
Yen, W. M. (1986). The
choice of scale for educational measurement: An IRT
perspective. Journal of
Educational
Measurement, 23(4), 299–325.
(2007). Vertical
scaling and No Child Left
Behind. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking
and aligning scores and
scales (pp. 273–283). Springer.
