انت هنا الان : شبكة جامعة بابل > موقع الكلية > نظام التعليم الالكتروني > مشاهدة المحاضرة

Computer-based testing

الكلية كلية التربية للعلوم الانسانية     القسم قسم اللغة الانكليزية     المرحلة 4
أستاذ المادة منير علي خضير ربيع       1/27/2012 3:43:36 PM

Computer-based testing

Computer-based testing has witnessed rapid growth
in the past decade and computers are now used to
deliver language tests in many settings. A computerbased
version of the TOEFL was introduced on a
regional basis in the summer of 1998, tests are now
available on CD ROM, and the Internet is increasingly
used to deliver tests to users. Alderson (1996)
points out that computers have much to offer
language testing: not just for test delivery, but also for
test construction, test compilation, response capture,
test scoring, result calculation and delivery, and test
analysis. They can also, of course, be used for storing
tests and details of candidates.
In short, computers can be used at all stages in the
test development and administration process. Most
work reported in the literature, however, concerns
the compilation, delivery and scoring of tests by
computer. Fulcher (1999b) describes the delivery of
an English language placement test over the Web and
Gervais (1997) reports the mixed results of transferring
a diagnostic paper-and-pencil test to the computer.
Such articles set the scene for studies of
computer-based testing which compare the accuracy
of the computer-based test with a traditional paperand-
pencil test, addressing the advantages of a computer-
delivered test in terms of accessibility and
speed of results, and possible disadvantages in terms
of bias against those with no computer familiarity, or
with negative attitudes to computers.
This concern with bias is a recurrent theme in the
literature, and it inspired a large-scale study by the
Educational Testing Service (ETS), the developers of
the computer-based version of the TOEFL, who
needed to show that such a test would not be biased
against those with no computer literacy. Jamieson et
ah (1998) describe the development of a computerbased
tutorial intended to train examinees to take
the computerised TOEFL. Taylor et al. (1999) examine
the relationship between computer familiarity
and TOEFL scores, showing that those with high
computer familiarity tend to score higher on the
traditional TOEFL. They compare examinees with
high and low computer familiarity in terms of their
performance on the computer tutorial and on computerised
TOEFL-like tasks.They claim that no relationship
was found between computer familiarity
and performance on the computerised tasks after
controlling for English language proficiency. They
conclude that there is no evidence of bias against
candidates with low computer familiarity, but also

take comfort in the fact that all candidates will be
able to take the computer tutorial before taking an
operational computer-based TOEFL.
The commonest use of computers in language
testing is to deliver tests adaptively (e.g.,Young et al.,
1996). This means that the computer adjusts the
items to be delivered to a candidate in the light of
that candidates success or failure on previous items.
If the candidate fails a difficult item, s/he is presented
with an easier item, and if s/he gets an item correct,
s/he is presented with a more difficult item. This has
advantages: firstly, candidates are presented with
items at their level of ability, and are not faced with
items that are either too easy or too difficult, and secondly,
computer-adaptive tests (CATs) are typically
quicker to deliver, and security is less of a problem
since different candidates are presented with different
items. Many authors discuss the advantages of CATs
(Laurier, 1998; Brown, 1997; Chalhoub-Deville &
Deville, 1999;Dunkel, 1999), but they also emphasise
issues that test developers and score users must
address when developing or using CATs. When
designing such tests, developers have to take a number
of decisions: what should the entry level be, and
how is this best determined for any given population?
At what point should testing cease (the socalled
exit point) and what should the criteria be
that determine this? How can content balance best
be assured in tests where the main principle for
adaptation is psychometric? What are the consequences
of not allowing users to skip items, and can
these consquences be ameliorated? How to ensure
that some items are not presented much more frequendy
than others (item exposure), because of their
facility, or their content? Brown and Iwashita (1996)
point out that grammar items in particular will vary
in difficulty according to the language background
of candidates, and they show how a computer-adaptive
test of Japanese resulted in very different item
difficulties for speakers of English and Chinese. Thus
a CAT may also need to take account of the language
background of candidates when deciding
which items to present, at least in grammar tests, and
conceivably also in tests of vocabulary.
Chalhoub-Deville and Deville (1999) point out
that, despite the apparent advantages of computerbased
tests, computer-based testing relies overwhelmingly
on selected response (typically multiplechoice
questions) discrete-point tasks rather than
performance-based items, and thus computer-based
testing may be restricted to testing linguistic knowledge
rather than communicative skills. However,
many computer-based tests include tests of reading,
which is surely a communicative skill. The question
is whether computer-based testing offers any added
value over paper-and-pencil reading tests: adaptivity
is one possibility, although some test developers are
concerned that since reading tests typically present
several items on one text — what is known in th

jargon as a testlet — they may not be suitable for
computer-adaptivity. This concern for the inherent
conservatism of computer-based testing has a long
history (see Alderson, 1986a, 1986b, for example), and
some claimed innovations, for example, computergenerated
cloze and multiple-choice tests (Coniam,
1997, 1998) were actually implemented as early as
the 1970s, and were often criticised in the literature
for risking the assumption of automatic validity. But
recent developments offer some hope. Burstein et al.
(1996) argue for the relevance of new technologies
in innovation in test design, construction, trialling,
delivery, management, scoring, analysis and reporting.
They review ways in which new input devices
(e.g., voice and handwriting recognition), output
devices (e.g., video, virtual reality), software such as
authoring tools, and knowledge-based systems for
language analysis could be used, and explore
advances in the use of new technologies in computer-
assisted learning materials. However, as they point
out, innovations applied to language assessment lag
behind their instructional counterparts ... the situation
is created in which a relatively rich language
presentation is followed by a limited productive
assessment. (1996:245).
No doubt, this is largely due to the fact that computer-
based tests require the computer to score
responses. However, Burstein et al. (1996) argue that
human-assisted scoring systems could reduce this
dependency. (Human-assisted scoring systems are
computer-based systems where most scoring of
responses is done by computer but responses that the
programs are unable to score are given to humans for
grading.) They also give details of free-response scoring
tools which are capable of scoring responses up
to 15 words long which correlate highly with human
judgements (coefficients of between .89 and .98 are
reported). Development of such systems for shortanswer
questions and for essay questions has since
gone on apace. For example, ETS has developed an
automated system for assessing productive language
abilities, called e-rater . e-rater uses natural language
processing techniques to duplicate the performance
of humans rating open-ended essays. Already, the
system is used to rate GMAT (Graduate Management
Admission Test) essays and research is ongoing
for other programmes, including second/foreign
language testing situations. Burstein et al. conclude
that the barriers to the successful use of technology
for language testing are less technical than conceptual
(1996: 253), but progress since that article was
published is extremely promising.
An example of the use of IT to assess aspects of
the speaking ability of second/foreign language
learners of English is PhonePass. PhonePass (www.
ordinate.org) is delivered over the telephone, and
candidates are asked to read texts aloud, repeat heard
sentences, say words opposite in meaning to heard
words, and give short answers to questions. The sys-

tern uses speech recognition technology to rate
responses, by comparing candidate performance to
statistical models of native and non-native performance
on the tasks. The system gives a score that
reflects a candidate s ability to understand and
respond appropriately to decontextualised spoken
material, with 40% of the evaluation reflecting the
fluency and pronunciation of the responses. Alderson
(2000c) reports that reliability coefficients of 0.91
have been found as well as correlations with the Test
of Spoken English (TSE) of 0.88 and with an ILR
(Inter-agency Language Roundtable) Oral
Proficiency Interview (OPI) of 0.77. An interesting
feature is that the scored sample is retained on a database,
classified according to the various scores
assigned. This enables users to access the speech sample,
in order to make their own judgements about
the performance for their particular purposes, and to
compare how their candidate has performed with
other speech samples that have been rated either the
same, or higher or lower.
In addition to e-rater and PhonePass there are a
number of promising initiatives in the use of computers
in testing. The listening section of the computer-
based TOEFL uses photos and graphics to
create context and support the content of the minilectures,
producing stimuli that more closely approximate
real world situations in which people do more
than just listen to voices. Moreover, candidates wear
headphones, can adjust the volume control, and are
allowed to control how soon the next question is
presented. One innovation in test method is that
candidates are required to select a visual or part of a
visual; in some questions candidates must select two
choices, usually out of four, and in others candidates
are asked to match or order objects or texts.
Moreover, candidates see and hear the test questions
before the response options appear. (Interestingly,
Ginther, forthcoming, suggests, however, that the use
of visuals in the computer-based TOEFL listening
test depresses scores somewhat, compared with traditionally
delivered tests. More research is clearly needed.)
In the Reading section candidates are required to
select a word, phrase, sentence or paragraph in the
text itself, and other questions ask candidates to
insert a sentence where it fits best. Although these
techniques have been used elsewhere in paper-andpencil
tests, one advantage of their computer format
is that the candidate can see the result of their choice
in context, before making a final decision. Although
these innovations may not seem very exciting,
Bennett (1998) claims that the best way to innovate
in computer-based testing is first to mount on computer
what can already be done in paper-and-pencil
format, with possible minor improvements allowed
by the medium, in order to ensure that the basic software
works well, before innovating in test method
and construct. Once the delivery mechanisms work
it is argued, then computer-based deliveries can be
developed that incorporate desirable innovations.
DIALANG (http://www.dialang.org) is a suite
of computer-based diagnostic tests (funded by the
European Union) which are available over the Internet,
thus capitalising on the advantages of Internetbased
delivery (see below). DIALANG uses selfassessment
as an integral part of diagnosis. Users
self-ratings are combined with objective test results
in order to identify a suitably difficult test for the
user. DIALANG gives users feedback immediately,
not only on their test scores, but also on the relationship
between their test results and their self-assessment.
DIALANG also gives extensive advice to users
on how they can progress from their current level to
the next level of language proficiency, basing this
advice on the Common European Framework
(Council of Europe, 2001).The interface and support
language, and the language of self-assessment and of
feedback, can be chosen by the test user from a list of
14 European languages. Users can decide which skill
or language aspect (reading, writing, listening, grammar
and vocabulary) they wish to be tested in, in any
one of the same 14 European languages. Currently
available test methods consist of multiple-choice, gapfilling
and short-answer questions, but DIALANG
has already produced CD-based demonstrations of 18
different experimental item types which could be
implemented in the future, and the CD demonstrates
the use of help, clue, dictionary and multiple-attempt
features.
Although DIALANG is limited in its ability to
assess users productive language abilities, the experimental
item types include a promising combination
of self-assessment and benchmarking. Tasks for the
elicitation of speaking and writing performances are
administered to pilot candidates and performances are
rated by human judges.Those performances on which
raters achieve the greatest agreement are chosen as
benchmarks . A DIALANG user is presented with the
same task, and, in the case of a writing task, responds
via the keyboard. The user s performance is then presented
on screen alongside the pre-rated benchmarks.
The user can compare their own performance with
the benchmarks. In addition, since the benchmarks are
pre-analysed, the user can choose to see raters comments
on various features of the benchmarks, in
hypertext form, and consider whether they could produce
a similar quality of such features. In the case of
Speaking tasks, the candidate is simply asked to imagine
how they would respond to the task, rather than
actually to record their performance. They are then
presented with recorded benchmark performances,
and are asked to estimate whether they could do better
or worse than each performance. Since the performances
are graded, once candidates have self-assessed
themselves against a number of performances, the system
can tell them roughly what level their own (imagined)
performance is likely to be.
These developments illustrate some of the advantages
of computer-based assessment, which make
computer-based testing not only more user-friendly,
but also more compatible with language pedagogy.
However, Alderson (2000c) argues the need for a
research agenda, which would address the challenge
of the opportunities afforded by computer-based
testing and the data that can be amassed. Such an
agenda would investigate the comparative advantages
and added value of each form of assessment — ITbased
or not IT-based. This includes issues like the
effect of providing immediate feedback, support
facilities, second attempts, self-assessment, confidence
testing, and the like. Above all, it would seek to throw
more light onto the nature of the constructs that can
be tested by computer-based testing:
What is needed above all is research that will reveal more about
the validity of the tests, that will enable us to estimate the effects
of the test method and delivery medium; research that will provide
insights into the processes and strategies test-takers use;
studies that will enable the exploration of the constructs that are
being measured, or that might be measured ... And we need
research into the impact of the use of the technology on learning,
on learners and on the curriculum.



المادة المعروضة اعلاه هي مدخل الى المحاضرة المرفوعة بواسطة استاذ(ة) المادة . وقد تبدو لك غير متكاملة . حيث يضع استاذ المادة في بعض الاحيان فقط الجزء الاول من المحاضرة من اجل الاطلاع على ما ستقوم بتحميله لاحقا . في نظام التعليم الالكتروني نوفر هذه الخدمة لكي نبقيك على اطلاع حول محتوى الملف الذي ستقوم بتحميله .