Journal of Pedagogical Sociology and Psychology
Assessing the psychometric properties of AI-generated multiple-choice exams in a psychology subject
Jomar Saif P. Baudin 1 *
More Detail
1 Faculty of Psychology Program, Social Sciences Department, College of Arts and Sciences, Southern Luzon State University, Lucban, Quezon, Philippines
* Corresponding Author
Open Access Full Text (PDF)
ARTICLE INFO

Journal of Pedagogical Sociology and Psychology, 2025 - Volume 7 Issue 3, pp. 18-34
https://doi.org/10.33902/jpsp.202536891

Article Type: Research Article

Published Online: 13 Sep 2025

Views: 7 | Downloads: 2

ABSTRACT
This study assessed the psychometric properties of AI-generated multiple-choice questions in undergraduate psychology education, specifically focusing on an Experimental Psychology course. Using a mixed-methods approach, we evaluated 80 multiple-choice questions created by ChatGPT-4 through expert content validation, administration to undergraduate psychology students, and comprehensive psychometric analysis. Results indicated that AI-generated items demonstrated reasonable content validity with strongest ratings for clarity and weakest for distractor quality. The assessment showed acceptable internal consistency and moderate test-retest reliability. Item difficulty analysis revealed a slight tendency toward easier items, with 58.75% of questions in the medium difficulty range. Discrimination indices were generally good, though 11.25% of items showed poor discrimination. Cognitive level analysis identified a significant imbalance in the distribution of items across Bloom's taxonomy, with 73.75% targeting lower-order thinking skills (remembering, understanding) and only 8.75% addressing higher-order skills (analyzing, evaluating). These findings suggest that while AI can generate psychometrically sound multiple-choice questions for psychology assessment, significant limitations exist in assessing higher-order cognitive skills. The results support a hybrid approach that leverages AI for generating fundamental knowledge assessment items while preserving human expertise for higher cognitive domain evaluation. This study contributes to the emerging understanding of AI applications in psychology education by providing domain-specific insights into the capabilities and limitations of AI-generated assessment tools.
KEYWORDS
In-text citation: (Baudin, 2025)
Reference: Baudin, J. S. P. (2025). Assessing the psychometric properties of AI-generated multiple-choice exams in a psychology subject. Journal of Pedagogical Sociology and Psychology, 7(3), 18-34. https://doi.org/10.33902/jpsp.202536891
In-text citation: (1), (2), (3), etc.
Reference: Baudin JSP. Assessing the psychometric properties of AI-generated multiple-choice exams in a psychology subject. Journal of Pedagogical Sociology and Psychology. 2025;7(3), 18-34. https://doi.org/10.33902/jpsp.202536891
In-text citation: (1), (2), (3), etc.
Reference: Baudin JSP. Assessing the psychometric properties of AI-generated multiple-choice exams in a psychology subject. Journal of Pedagogical Sociology and Psychology. 2025;7(3):18-34. https://doi.org/10.33902/jpsp.202536891
In-text citation: (Baudin, 2025)
Reference: Baudin, Jomar Saif P.. "Assessing the psychometric properties of AI-generated multiple-choice exams in a psychology subject". Journal of Pedagogical Sociology and Psychology 2025 7 no. 3 (2025): 18-34. https://doi.org/10.33902/jpsp.202536891
In-text citation: (Baudin, 2025)
Reference: Baudin, J. S. P. (2025). Assessing the psychometric properties of AI-generated multiple-choice exams in a psychology subject. Journal of Pedagogical Sociology and Psychology, 7(3), pp. 18-34. https://doi.org/10.33902/jpsp.202536891
In-text citation: (Baudin, 2025)
Reference: Baudin, Jomar Saif P. "Assessing the psychometric properties of AI-generated multiple-choice exams in a psychology subject". Journal of Pedagogical Sociology and Psychology, vol. 7, no. 3, 2025, pp. 18-34. https://doi.org/10.33902/jpsp.202536891
REFERENCES
  • Ahmed, W. M., Azhari, A. A., Alfaraj, A., Alhamadani, A., Zhang, M., & Lu, C.-T. (2024). The quality of ai-generated dental caries multiple choice questions: A comparative analysis of chatgpt and google bard language models. Heliyon, 10(7), e28198. https://doi.org/10.1016/j.heliyon.2024.e28198
  • Al Mashagbeh, M., Dardas, L., Alzaben, H., & Alkhayat, A. (2024). Comparative analysis of artificial intelligence-driven assistance in diverse educational queries: ChatGPT vs. Google Bard. Frontiers in Education, 9, 1429324. https://doi.org/10.3389/feduc.2024.1429324
  • Ali, F., & Talat, H. (2024). Ai integration in mcq development: Assessing quality in medical education: a systematic review. Life and Science, 5(3), 14. https://doi.org/10.37185/LnS.1.1.643
  • Anderson, L. W., & Krathwohl, D. R. (Eds.). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives. Longman.
  • Bitzenbauer, P. (2023). ChatGPT in physics education: A pilot study on easy-to-implement activities. Contemporary Educational Technology, 15(3), ep430. https://doi.org/10.30935/cedtech/13176
  • Cheung, B. H. H., Lau, G. K. K., Wong, G. T. C., Lee, E. Y. P., Kulkarni, D., Seow, C. S., Wong, R., & Co, M. T.-H. (2023). ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong, Singapore, Ireland, and the United Kingdom). PLOS ONE, 18(8), e0290691. https://doi.org/10.1371/journal.pone.0290691
  • Doughty, J., Wan, Z., Bompelli, A., Qayum, J., Wang, T., Zhang, J., Zheng, Y., Doyle, A., Sridhar, P., Agarwal, A., Bogart, C., Keylor, E., Kultur, C., Savelka, J., & Sakr, M. (2024). A comparative study of ai-generated (GPT-4) and human-crafted mcqs in programming education. In N. Herbert & C. Seton (Eds.), Proceedings of the 26th Australasian Computing Education Conference (pp. 114-123). Association for Computing Machinery. https://doi.org/10.1145/3636243.3636256
  • Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. In S. H. Irvine, & P. C. Kyllonen (Eds.), Educational measurement (4th ed., pp. 471-516). American Council on Education.
  • Kic-Drgas, J., & Kılıçkaya, F. (2024). Using Artificial Intelligence (Ai) to create language exam questions: A case study. XLinguae, 17(1), 20–33. https://doi.org/10.18355/XL.2024.17.01.02
  • Kıyak, Y. S., & Emekli, E. (2024). ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: A literature review. Postgraduate Medical Journal, 100(1189), 858–865. https://doi.org/10.1093/postmj/qgae065
  • Kyriazos, T. A. (2018). Applied psychometrics: Sample size and sample power considerations in factor analysis (Efa, cfa) and sem in general. Psychology, 09(08), 2207–2230. https://doi.org/10.4236/psych.2018.98126
  • Mead, A. D., & Zhou, C. (2024). Evaluating the quality of ai-generated items for a certification exam. Journal of Applied Testing Technology. Advance online publication. https://jattjournal.net/index.php/atp/article/view/173204
  • Mistry, N. P., Saeed, H., Rafique, S., Le, T., Obaid, H., & Adams, S. J. (2024). Large language models as tools to generate radiology board-style multiple-choice questions. Academic Radiology, 31(9), 3872–3878. https://doi.org/10.1016/j.acra.2024.06.046
  • Nasution, N. E. A. (2023). Using artificial intelligence to create biology multiple choice questions for higher education. Agricultural and Environmental Education, 2(1), em002. https://doi.org/10.29333/agrenvedu/13071
  • Ngo, A., Gupta, S., Perrine, O., Reddy, R., Ershadi, S., & Remick, D. (2024). ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Academic Pathology, 11(1), 100099. https://doi.org/10.1016/j.acpath.2023.100099
  • O, K.-M. (2024). A comparative study of AI-human-made and human-made test forms for a university TESOL theory course. Language Testing in Asia, 14(1), 19. https://doi.org/10.1186/s40468-024-00291-3
  • Owan, V. J., Abang, K. B., Idika, D. O., Etta, E. O., & Bassey, B. A. (2023). Exploring the potential of artificial intelligence tools in educational measurement and assessment. Eurasia Journal of Mathematics, Science and Technology Education, 19(8), em2307. https://doi.org/10.29333/ejmste/13428
  • Sihite, M. R., Meisuri, M., & Sibarani, B. (2023). Examining the validity and reliability of Chatgpt 3. 5-generated reading comprehension questions for academic texts. Randwick International of Education and Linguistics Science Journal, 4(4), 937–944. https://doi.org/10.47175/rielsj.v4i4.835
  • Van de Poll, J., Yong, Y., & van de Weijer, N. (2022). Validity of AI-generated one-off questionnaires for organizational transformation. International Journal of Economics, Business and Management Research, 6(2), 122–130. https://doi.org/10.51505/IJEBMR.2022.6208
LICENSE
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.