Experiments on automatic question generation from free texts based on Chinese text-mining techniques

Andrew K Lui, Chan Kin Yan, Leung Ho Hin, Lai Ho Yin and Ki Yan Lok
The Open University of Hong Kong
Hong Kong SAR, China


The key objective of intelligent tutoring systems is to provide individualized instruction based on the learning needs of students, and offering timely and context-sensitive review questions increases student retention of what they have just studied and also improves their level of engagement with the content. Supporting such a function requires a large question bank from which suitable questions can be selected based on the student and instructor models. However, using automatic question generation can reduce the cost and effort involved in creating such a question bank. A useful variant of automatic question generation considers free texts as the source and employs text-mining techniques to extract related information for producing questions. As students study textual course content with an intelligent tutoring system, automatically generated questions based on the relevant content should provide more opportunities for review and more appropriate self-assessment.

The aim of this paper is to discuss techniques for automatic question generation based on text- mining on Chinese text sources. This differs from previous work on the area in that: (1) the techniques are applicable to any course with its content in electronic free-text form, unlike techniques that rely on dedicated concept maps, data tables, or rule-based engines for a particular course; and (2) the techniques work with Chinese texts, and therefore can be employed in the increasing number of Chinese-based online courses in our University. The challenge in this work lies in making sense of free-structured Chinese texts, in which keywords or even words are not readily identifiable because of the absence of word separators.

The procedure for automatic question generation begins with segmentation of Chinese sentences into words and the extraction of named entities such as 'person', 'organization', 'events', 'dates' and 'locations'. The paper outlines two methods: a variational approach that finds the optimal segmentation based on dictionaries, vocabulary and feature patterns, and a quicker variant based on genetic algorithms. After the identification of the named entities and their relations, the next step is to randomly select named entities and apply them to pre-defined question templates. The paper describes the techniques, which have been applied to a Chinese history course and a current affairs course, the details of implementation and an evaluation of the above two applications; and it focuses mainly on technological issues and pedagogical concerns to be addressed in future work.