Computer-based testing
Computer-based testing has witnessed rapid growth in the past decade and computers are now used to deliver language tests in many settings. A computerbased version of the TOEFL was introduced on a regional basis in the summer of 1998, tests are now available on CD ROM, and the Internet is increasingly used to deliver tests to users. Alderson (1996) points out that computers have much to offer language testing: not just for test delivery, but also for test construction, test compilation, response capture, test scoring, result calculation and delivery, and test analysis. They can also, of course, be used for storing tests and details of candidates. In short, computers can be used at all stages in the test development and administration process. Most work reported in the literature, however, concerns the compilation, delivery and scoring of tests by computer. Fulcher (1999b) describes the delivery of an English language placement test over the Web and Gervais (1997) reports the mixed results of transferring a diagnostic paper-and-pencil test to the computer. Such articles set the scene for studies of computer-based testing which compare the accuracy of the computer-based test with a traditional paperand- pencil test, addressing the advantages of a computer- delivered test in terms of accessibility and speed of results, and possible disadvantages in terms of bias against those with no computer familiarity, or with negative attitudes to computers. This concern with bias is a recurrent theme in the literature, and it inspired a large-scale study by the Educational Testing Service (ETS), the developers of the computer-based version of the TOEFL, who needed to show that such a test would not be biased against those with no computer literacy. Jamieson et ah (1998) describe the development of a computerbased tutorial intended to train examinees to take the computerised TOEFL. Taylor et al. (1999) examine the relationship between computer familiarity and TOEFL scores, showing that those with high computer familiarity tend to score higher on the traditional TOEFL. They compare examinees with high and low computer familiarity in terms of their performance on the computer tutorial and on computerised TOEFL-like tasks.They claim that no relationship was found between computer familiarity and performance on the computerised tasks after controlling for English language proficiency. They conclude that there is no evidence of bias against candidates with low computer familiarity, but also
take comfort in the fact that all candidates will be able to take the computer tutorial before taking an operational computer-based TOEFL. The commonest use of computers in language testing is to deliver tests adaptively (e.g.,Young et al., 1996). This means that the computer adjusts the items to be delivered to a candidate in the light of that candidates success or failure on previous items. If the candidate fails a difficult item, s/he is presented with an easier item, and if s/he gets an item correct, s/he is presented with a more difficult item. This has advantages: firstly, candidates are presented with items at their level of ability, and are not faced with items that are either too easy or too difficult, and secondly, computer-adaptive tests (CATs) are typically quicker to deliver, and security is less of a problem since different candidates are presented with different items. Many authors discuss the advantages of CATs (Laurier, 1998; Brown, 1997; Chalhoub-Deville & Deville, 1999;Dunkel, 1999), but they also emphasise issues that test developers and score users must address when developing or using CATs. When designing such tests, developers have to take a number of decisions: what should the entry level be, and how is this best determined for any given population? At what point should testing cease (the socalled exit point) and what should the criteria be that determine this? How can content balance best be assured in tests where the main principle for adaptation is psychometric? What are the consequences of not allowing users to skip items, and can these consquences be ameliorated? How to ensure that some items are not presented much more frequendy than others (item exposure), because of their facility, or their content? Brown and Iwashita (1996) point out that grammar items in particular will vary in difficulty according to the language background of candidates, and they show how a computer-adaptive test of Japanese resulted in very different item difficulties for speakers of English and Chinese. Thus a CAT may also need to take account of the language background of candidates when deciding which items to present, at least in grammar tests, and conceivably also in tests of vocabulary. Chalhoub-Deville and Deville (1999) point out that, despite the apparent advantages of computerbased tests, computer-based testing relies overwhelmingly on selected response (typically multiplechoice questions) discrete-point tasks rather than performance-based items, and thus computer-based testing may be restricted to testing linguistic knowledge rather than communicative skills. However, many computer-based tests include tests of reading, which is surely a communicative skill. The question is whether computer-based testing offers any added value over paper-and-pencil reading tests: adaptivity is one possibility, although some test developers are concerned that since reading tests typically present several items on one text — what is known in th
jargon as a testlet — they may not be suitable for computer-adaptivity. This concern for the inherent conservatism of computer-based testing has a long history (see Alderson, 1986a, 1986b, for example), and some claimed innovations, for example, computergenerated cloze and multiple-choice tests (Coniam, 1997, 1998) were actually implemented as early as the 1970s, and were often criticised in the literature for risking the assumption of automatic validity. But recent developments offer some hope. Burstein et al. (1996) argue for the relevance of new technologies in innovation in test design, construction, trialling, delivery, management, scoring, analysis and reporting. They review ways in which new input devices (e.g., voice and handwriting recognition), output devices (e.g., video, virtual reality), software such as authoring tools, and knowledge-based systems for language analysis could be used, and explore advances in the use of new technologies in computer- assisted learning materials. However, as they point out, innovations applied to language assessment lag behind their instructional counterparts ... the situation is created in which a relatively rich language presentation is followed by a limited productive assessment. (1996:245). No doubt, this is largely due to the fact that computer- based tests require the computer to score responses. However, Burstein et al. (1996) argue that human-assisted scoring systems could reduce this dependency. (Human-assisted scoring systems are computer-based systems where most scoring of responses is done by computer but responses that the programs are unable to score are given to humans for grading.) They also give details of free-response scoring tools which are capable of scoring responses up to 15 words long which correlate highly with human judgements (coefficients of between .89 and .98 are reported). Development of such systems for shortanswer questions and for essay questions has since gone on apace. For example, ETS has developed an automated system for assessing productive language abilities, called e-rater . e-rater uses natural language processing techniques to duplicate the performance of humans rating open-ended essays. Already, the system is used to rate GMAT (Graduate Management Admission Test) essays and research is ongoing for other programmes, including second/foreign language testing situations. Burstein et al. conclude that the barriers to the successful use of technology for language testing are less technical than conceptual (1996: 253), but progress since that article was published is extremely promising. An example of the use of IT to assess aspects of the speaking ability of second/foreign language learners of English is PhonePass. PhonePass (www. ordinate.org) is delivered over the telephone, and candidates are asked to read texts aloud, repeat heard sentences, say words opposite in meaning to heard words, and give short answers to questions. The sys-
tern uses speech recognition technology to rate responses, by comparing candidate performance to statistical models of native and non-native performance on the tasks. The system gives a score that reflects a candidate s ability to understand and respond appropriately to decontextualised spoken material, with 40% of the evaluation reflecting the fluency and pronunciation of the responses. Alderson (2000c) reports that reliability coefficients of 0.91 have been found as well as correlations with the Test of Spoken English (TSE) of 0.88 and with an ILR (Inter-agency Language Roundtable) Oral Proficiency Interview (OPI) of 0.77. An interesting feature is that the scored sample is retained on a database, classified according to the various scores assigned. This enables users to access the speech sample, in order to make their own judgements about the performance for their particular purposes, and to compare how their candidate has performed with other speech samples that have been rated either the same, or higher or lower. In addition to e-rater and PhonePass there are a number of promising initiatives in the use of computers in testing. The listening section of the computer- based TOEFL uses photos and graphics to create context and support the content of the minilectures, producing stimuli that more closely approximate real world situations in which people do more than just listen to voices. Moreover, candidates wear headphones, can adjust the volume control, and are allowed to control how soon the next question is presented. One innovation in test method is that candidates are required to select a visual or part of a visual; in some questions candidates must select two choices, usually out of four, and in others candidates are asked to match or order objects or texts. Moreover, candidates see and hear the test questions before the response options appear. (Interestingly, Ginther, forthcoming, suggests, however, that the use of visuals in the computer-based TOEFL listening test depresses scores somewhat, compared with traditionally delivered tests. More research is clearly needed.) In the Reading section candidates are required to select a word, phrase, sentence or paragraph in the text itself, and other questions ask candidates to insert a sentence where it fits best. Although these techniques have been used elsewhere in paper-andpencil tests, one advantage of their computer format is that the candidate can see the result of their choice in context, before making a final decision. Although these innovations may not seem very exciting, Bennett (1998) claims that the best way to innovate in computer-based testing is first to mount on computer what can already be done in paper-and-pencil format, with possible minor improvements allowed by the medium, in order to ensure that the basic software works well, before innovating in test method and construct. Once the delivery mechanisms work it is argued, then computer-based deliveries can be developed that incorporate desirable innovations. DIALANG (http://www.dialang.org) is a suite of computer-based diagnostic tests (funded by the European Union) which are available over the Internet, thus capitalising on the advantages of Internetbased delivery (see below). DIALANG uses selfassessment as an integral part of diagnosis. Users self-ratings are combined with objective test results in order to identify a suitably difficult test for the user. DIALANG gives users feedback immediately, not only on their test scores, but also on the relationship between their test results and their self-assessment. DIALANG also gives extensive advice to users on how they can progress from their current level to the next level of language proficiency, basing this advice on the Common European Framework (Council of Europe, 2001).The interface and support language, and the language of self-assessment and of feedback, can be chosen by the test user from a list of 14 European languages. Users can decide which skill or language aspect (reading, writing, listening, grammar and vocabulary) they wish to be tested in, in any one of the same 14 European languages. Currently available test methods consist of multiple-choice, gapfilling and short-answer questions, but DIALANG has already produced CD-based demonstrations of 18 different experimental item types which could be implemented in the future, and the CD demonstrates the use of help, clue, dictionary and multiple-attempt features. Although DIALANG is limited in its ability to assess users productive language abilities, the experimental item types include a promising combination of self-assessment and benchmarking. Tasks for the elicitation of speaking and writing performances are administered to pilot candidates and performances are rated by human judges.Those performances on which raters achieve the greatest agreement are chosen as benchmarks . A DIALANG user is presented with the same task, and, in the case of a writing task, responds via the keyboard. The user s performance is then presented on screen alongside the pre-rated benchmarks. The user can compare their own performance with the benchmarks. In addition, since the benchmarks are pre-analysed, the user can choose to see raters comments on various features of the benchmarks, in hypertext form, and consider whether they could produce a similar quality of such features. In the case of Speaking tasks, the candidate is simply asked to imagine how they would respond to the task, rather than actually to record their performance. They are then presented with recorded benchmark performances, and are asked to estimate whether they could do better or worse than each performance. Since the performances are graded, once candidates have self-assessed themselves against a number of performances, the system can tell them roughly what level their own (imagined) performance is likely to be. These developments illustrate some of the advantages of computer-based assessment, which make computer-based testing not only more user-friendly, but also more compatible with language pedagogy. However, Alderson (2000c) argues the need for a research agenda, which would address the challenge of the opportunities afforded by computer-based testing and the data that can be amassed. Such an agenda would investigate the comparative advantages and added value of each form of assessment — ITbased or not IT-based. This includes issues like the effect of providing immediate feedback, support facilities, second attempts, self-assessment, confidence testing, and the like. Above all, it would seek to throw more light onto the nature of the constructs that can be tested by computer-based testing: What is needed above all is research that will reveal more about the validity of the tests, that will enable us to estimate the effects of the test method and delivery medium; research that will provide insights into the processes and strategies test-takers use; studies that will enable the exploration of the constructs that are being measured, or that might be measured ... And we need research into the impact of the use of the technology on learning, on learners and on the curriculum.
المادة المعروضة اعلاه هي مدخل الى المحاضرة المرفوعة بواسطة استاذ(ة) المادة . وقد تبدو لك غير متكاملة . حيث يضع استاذ المادة في بعض الاحيان فقط الجزء الاول من المحاضرة من اجل الاطلاع على ما ستقوم بتحميله لاحقا . في نظام التعليم الالكتروني نوفر هذه الخدمة لكي نبقيك على اطلاع حول محتوى الملف الذي ستقوم بتحميله .
|