Revising an Exam for the Purpose of Properly Inferring Socio-Linguistic Competencies in a Secondary School English Conversation Class

Oct 10

Justification

The principal reason for why I wish to revise this exam is that at the time of its creation, prior to the start of the M.Phil program, I had complete creative control over both the course curriculum and assessment of the course. Revisiting this exam allows me to return to the pedagogical views which I had at the time in order to critically examine them in light of the content from the Language Assessment module. I believe that this critical analysis of my past work will allow me the space needed to reexamine the assumptions that I held in regards to what makes a good assessment in light of the information I now hold. This I hope will lead to a synthesis of what I learned through my classroom experience and the theoretical groundwork which the course provided me. Furthermore, I still find this type of assessment to be highly relevant for students living in democratic societies. The ability for voting age learners to autonomously and critically develop, express, and vocalize their political views is something which scholars have argued to necessary for the survival of democracy (Little, 2004)

Learners

This test was originally created for a group of five secondary school students who attended a school in Aomori, Japan. This iteration of the course consisted solely of female students who were all 18 years old at the administration of this test and were planning on pursuing further education at the tertiary level. Each student had been studying English since entering junior high school 6 years earlier. Being a class titled “English Conversation”, the learning goals of these students were primarily oriented around the development of oral production and aural reception skills. As students, they all shared an interest in debate and discourse surrounding the political and social issues of the day, which is an important piece of biographical information in regards to the appropriateness of the test’s content. In terms of learner goals, these students were all aiming to enter a four year university program, with majors consisting of a variety of fields of study in the liberal arts.

Course of Study

All students who took this test were enrolled in an advanced level English conversation class which met for two to three fifty minute classes per week during the course of an academic year. Being that this course was explicitly billed as a conversation class, my design philosophy for the syllabus was one which emphasized spoken production, auditory reception, and socio-linguistic competencies, often at the expense of written interactions and reading comprehension. These later domains were the focus of their compulsory English class. In addition to the unit described in this test cycle, the curriculum featured much task based language learning, which covered scenarios ranging from job interviews to international exchange projects.

Purpose of the Assessment and Test Design

There are two overall aims of this assignment and the coursework which was attached to it. The first aim was to develop the ability to produce logically consistent arguments to justify the students' views on a number of political issues of the day. The second aim was to develop their ability to express these views in English. These two aims share an equal weight in regards to their share of the overall score. This class made extensive use of content based learning, and the assessments are a reflection of this. The framework which the assignment will operate under is borrowed from Swain (Swain, 2001 cited Green, 2014). Swain argues for paired speaking tasks where both parties share turn taking responsibility. This reflects the natural flow of the class, and as Green argues (Green 2014 P. 140), can lead to positive washback by encouraging classroom interaction. The paired work will consist of a socratic style dialogue between assessor and the assessed, where the instructor will ask questions intended to draw out concise, syllogistic answers and refutations from students regarding their views.

The assessment seeks to infer the degree of competence of these students in two distinctive domains: the first being their linguistic competencies in spoken English, and the second is their to rationally state, justify, and defend their point of view on the political issues of the upcoming election. It is important to note here that this assessment is not grading the content of these views, only the ability to lucidly express and logically defend them.

The test content was derived from a website published by the civic advocacy group “Everyone’s Checklist for Choosing Their Future” (みなの未来を選ぶためのチェックリスト Everyone’s Checklist for Choosing Their Future, 2023). The website outlines the main issues facing the electorate in the lead up to the 2021 general election. As part of the testing cycle, the course examined these issues, learned the appropriate way to describe the Japanese political system in English by acquiring the needed vocabulary, and developed the skills needed to articulate their positions. For the examination, the task for the students is to, through oral production, articulate a persuasive argument about where they stood on these issues, up to and including why they believed the opposing view was wrong. It is therefore necessary for the students to defend their own view while simultaneously critiquing the opposing viewpoint.

Practicality

Practicality can be understood as “The difference between the resources that will be required in the development and use of an assessment and the resources that will be available for these activities” (Bachman &Palmer, 2010, P. 262, cited Green, 2014). In essence, practicality has to do with the ability to effectively design and implement the assessment in question. A high degree of practicality will see an assessment run smoothly from development to implementation with little external factors affecting the ability of test takers to undergo the exam in either a positive or negative light.

Especially relevant for a test which took place in an educational context outside the ones which I am accustomed to is what Buck calls “system constraints” (Buck, 2009 cited Green, 2014). These can be understood as the cultural, political, and workplace factors that can hinder the ability of a test to be implemented with a high degree of practicality. Although the Japanese education system can at times foster a strong inclination towards teacher centered learning curriculum which could inhibit the autonomous communicative potential of students (Dias, 2010), this was counteracted by the institutional support I received in implementing this style of course in my school.

It is arguable that the exam which has been revised contains a high degree of practicality, as the test content was easy to adapt from the news media, which also had the added benefit of making the assessment materially highly relevant to the learners. In addition, the format of the test requires no special materials, and the necessity for physical space is limited. Due to this, I feel the systemic forces actually induced this sort of assessment rather than constrained it.

Reliability

Following Green’s (Green, 2014 P.73) seven step program to assess design reliability,the following methods were implemented in order to increase the reliability of this assessment. Students were made aware of the test content before the examination, so there were no surprises regarding what they had to know. Although for the speaking portion, the variety of tasks is limited, the students have the flexibility of choosing which questions they wish to be graded on. This may also lead to positive washback, as students are more likely to engage with the issues that they are most interested in. While phonological and socio-cultural factors are accounted for, the limited scope and focus on content allows the students to focus on the main part of the assessment rather than studying in a manner that encourages box ticking. As best as possible, conditions between the individual students are as consistent as possible, being that the assessment takes place in the same room and within a fifty minute time frame. The assessment features a controlled and detailed scoring rubric, available to the students beforehand, which clearly enunciates the expectations for the assessment. It is important to note that if this assessment were to be implemented for a real situation, then the rubric would be translated into the first language of the students in order to increase transparency and reduce grading ambiguity. While points six and seven are valid methods of increasing reliability, it is not always possible to have more than one rater, especially in a public school setting where doing so would involve another teacher taking on more work, while it is not possible to increase the size of my class to increase the population of test takers. Test reliability can also be increased by accounting for a separate group of factors which are based on O’Sullivan and Green’s test taker characteristic scale (O’Sullivan & Green, 2011 cited Green, 2014), the cohort of students taking this test share much more in common than not. While full psychological and physiological variables can never be exactly the same, all students were the same age and showed signs of all being motivated with a high degree of openness to new experiences. In addition, their life experiences share much in common, at least in their formal education and lifestyle, as the students have lived in the same area and attended the same schools since early childhood.

Validity

As Messick argues (Messick, 1989 cited Green, 2014), the two greatest threats to assessment validity come from construct irrelevance and construct underrepresentation. The threat of construct irrelevance can be accounted for by looking at the role of the questions in this assessment. The questions in this assessment were the issues of the election cycle, meaning that assessing the students ability to understand, express, and defend their views for the purpose of this assessment simultaneously develops their ability to critically analyze the issues facing their society. In essence, there is almost complete overlap between the test criterion and the test constructs. Being that this course was explicitly designed to be a course which focused on the aural and oral receptive and productive skills, construct underrepresentation can be accounted for due to the fact that the assessment addresses only the skills that the overarching course to which this assessment is attached to focuses on. There is no surprise writing or reading sections on this exam. Content validity is accounted for due to the same set of reasons that the threat of construct irrelevance is accounted for. The goal for these students is to develop their own analytical skills in contemporary political discourse, which the questions are a reflection of. As noted by Green (Green, 2014 p.79) face validity can be addressed by gathering judgements of non-experts in language testing from the stakeholder group. As this exam was the culmination of one of the course's exam cycles, the students and I had ample time to reflect on these issues. As non-testing experts, they felt that the questions they were asked to develop answers for were accurate reflections of the election issues. In addition, the questions themselves are derived from a non-profit voter advocacy group, another source of face validity as these people are not testing experts either.

As a communicative language test, we can justify the perception of this test having a high degree of construct validity through Hyme’s theory of communicative competence. In particular, the shift from a “psychological perspective on language, which sees language as an internal phenomenon, to a sociological one, focusing on the external, social functions of language” (Mcnamara, 2000 P. 17) is particularly relevant. Although the opportunity for serious political discourse in the English language is admittedly very limited for these students, More than anything else, the goal of this exam is, as Holec states (Holec, 1981 P.3 cited Little, 2004), to move us from a “product of his society” to a “producer of his society”

Rating Procedures and Scales

As Green (Green, 2014 p. 130) notes, spoken interactions can often feature comparatively simple grammatical structures and vocabulary ranges when considering their counterparts in written interactions. This is important when developing a scoring rubric, especially for high school students in and around a B1 level. Following the advice of Fulcher (Fulcher, 2003 cited Green, 2014), it is wise from the outset to either prioritize fluency or accuracy. The overall test design and philosophy synergize well with a system that encourages a prioritization of fluency in this situation. Being that the students are at most a B1 level and at the age of 18, I believe it preferable for them to aim for a lucid transmission of ideas, rather than one with an overly academic level of precision. This is because they still have far to go in both their language learning and their development of their philosophical underpinnings of their political views. In addition, an emphasis on accuracy may create a testing environment that encourages conservative, risk averse use of language. The students may feel that it is better to play it safe and use simple language in order to avoid being penalized instead of aiming to expand their communicative potential As mentioned by McNamara (McNamara, 2000 P. 64) criterion referenced measurement is a justifiable position to take when the examiners are not interested in norm referencing the results. Being that the class only consists of five students, there is not much justification for norm referenced measurement on this assessment, especially when the exam content is as personalized as it is.

As McNamara (McNamara, 1996 cited Green, 2014) argues, how a test is assessed is an indicator of its entire theoretical basis. I believe this to be an accurate statement, and therefore designed the rating scales to be a reflection of the purpose of the assessment. Due to the niche nature of this exam, it was necessary to develop a task-specific rating scale, as ultimately the exam aims to test the coherence of their arguments as well as their linguistic competency. These reasons also lend themselves to the creation of an analytical rating scale. Students should have the opportunity to receive marks that reflect one aspect of their competency even if others are lacking. The relative importance of the different criteria are reflected by the weight they are given in their proportion of the overall exam score. These separate criteria are then to be scaled on a five level, empirically described descriptive system based on the descriptors found in the revised edition of the CEFR (Council of Europe, 2021).

Conclusion

As previously stated, this exam aims to infer the competencies of two distinct forms of knowledge: linguistic knowledge and social knowledge. The exam discerns this information by the rating of the syllogistic spoken English productions of the students in the course. As the trends and forces surrounding the continued march towards an ever more globalized and connected world continue, the ability to maintain a discourse surrounding the threats our planet and species face will continue to grow in importance. Through assessments such as the one outlined above, it is my hope that language education can continue performing the critical role it plays in increasing the mediation capabilities in the next generations of language learners.

Index A - Sample of the Exam

第三下に四つの質問に答えなさい. 答えに貴方の意見を守りましょう、反対意見も攻撃しましょう.

1. Should Japan use nuclear power and build new nuclear power plants?

2. Increase in sales tax

3. Should Japan increase military spending?

4. Same Sex Marriage

5. Article 9 Reform

6. 6. Should Women who are married be able to keep their last names?

7. An issue you are concerned about

Part 3. Answer four of the questions below. Consider why you are right and why the other side is wrong.

1. Should Japan use nuclear power and build new nuclear power plants?

2. Increase in sales tax

3. Should Japan increase military spending?

4. Same Sex Marriage

5. Article 9 Reform

6. 6. Should Women who are married be able to keep their last names?

7. An issue you are concerned about

Index B - Scoring Rubric-All ratings based on CEFR revised Edition (Council of Europe 2020)

Score

Putting a Case-15 points

Information exchange - 10 points

Linking To previous knowledge - 10 points

Grammatical Accuracy - 10 points

Excellent
10-15 points
Or
7-10 points

Can develop their position on the issue to a degree which permits ease of flow

Can explain their reasoning on why the opposing view is flawed

Can exchange and confirm factual information in order to support their position

Can show how new information can be linked with previous information with a degree of finesse.

Can communicate with a degree of accuracy and control, even if influence of the mother-tongue is present.

Competent
5-10 points
Or
4-6 points

Can state their position on the topics, but lacks a coherent narrative structure

Can only refute the opposing view with simple subjective statements

Can recount details to support their view, but lack appropriate composure

Can attempt to link information, but the attempt results in a vagueness in the connection.

Can communicate ideas semi-fluidly, even if inaccuracies cloud full understanding.

Needs improvement
0-4 points
Or
0-3 points

Can state their position on issues, but without clear reasoning

Does not attempt to address the other point of view

Cant state basic reasons to support their view but lack the ability to connect their evidence to their claim

Can state a connection between two pieces of information, but lacks the ability to construct a fluid link

Can operate with a degree of control that allows a transmission of ideas, but inaccuracies inhibit full understanding

Phonological Control - 10 points

Thematic Development - 10 points

Propositional Precision - 10 points

Vocabulary Range - 10 points

Excellent
10-15 points
Or
7-10 points

Can transmit information in a manner which is not heavily impacted by a lack of phonological control, though errors may be present

Can develop their positions in a logically consistent manner

Can explain the main idea of their positions with a degree of precision

Can use a good range of vocabulary to augment the precision of their argument

Competent
5-10 points
Or
4-6 points

Can transmit information, though phonological errors lead to a difficulty in following the argument

Can just their positions, but their argument lacks a logical coherence that doesn’t affect overall meaning

Can explain the main idea of their positions, but their attempt leaves hints of ambiguity

Can use a range of vocabulary of an appropriate register

Needs improvement
0-4 points
Or
0-3 points

Can state their views, but overall phonetic control hampers the fluency and ease of communication

Can justify their positions, but their argument lacks a logical coherence that affects overall meaning

Can attempt to explain their ideas, but their attempt leaves a large degree of ambiguity

Can use less precise vocabulary which subtracts from their overall cohesion

Fluency-15 points

Excellent
10-15 points
Or
7-10 points

Can express themselves with a degree of ease, allowing for some breaks in the overall flow

Competent
5-10 points
Or
4-6 points

Can express their views, but often with several moments of repair

Needs improvement

0-4 points
Or
0-3 points

Can keep the flow of information at a comprehensible level, though large breaks subtract from overall cohesion

Overall score ______ / 100

References

Bachman, L.F. and Palmer, A. S. (2010) Language Assessment in Practice. Oxford, UK: Oxford University Press

Buck, G. (2009) ‘Challenges and Constraints in language in test development’, in J.C. Alderson (ed.) The Politics of Language Education: Individuals and Institutions (pp. 166-184). Bristol, UK: Multilingual Matters.

Council of Europe (2020). The Common European Framework of Reference for Languages: learning, teaching, assessment Companion volume. Strasbourg Council of Europe Publishing

Dias, J. (2010). Learner Autonomy in Japan: Transforming 'Help Yourself' from Threat to Invitation. Computer Assisted Language Learning, 13(1), 49-64. https://doi.org/10.1076/0958-8221(200002)13:1;1-k

2023, み. Everyone’s Checklist for Choosing Their Future. (2023). Retrieved 4/24/2023 from https://choiceisyours2021.jp/

Fulcher, G. (2003). Testing Second Language Speaking. Longman.

Green, A. (2014 ). Exploring Language Assessment and Testing.

Holec, H. (1981). Autonomy and foreign language learning Pergamon

Little, D. (2004). Learner autonomy: drawing together the threads of self-assessment,

goal-setting and reflection. European Centre for Modern Languages

Lyle Bachman, A. P. (2010). Language assessment in practice. Oxford University Press.

McNamara, T. (1996). Measuring Second Language Performance Longman.

McNamara, T. (2000). Language Testing Oxford Press.

Messick, S (1989) ‘Validity’, in R.L. Linn (ed) Educational Measurement (3rd edition) (pp. 13-103). New York, NY: Macmillan

O'Sullivan, B. and Green, A. B. ‘Test taker characteristics’, in L.Taylor (ed.) Examining Speaking: Research and Practice in Assessing Second Language Speaking. Cambridge, UK: Cambridge University Press and Cambridge ESOL.

Swain, M. (2016). Examining dialogue: another approach to content specification and to validating inferences drawn from test scores. Language Testing, 18(3), 275-302. https://doi.org/10.1177/026553220101800302

Ryan Molloy

Revising an Exam for the Purpose of Properly Inferring Socio-Linguistic Competencies in a Secondary School English Conversation Class

The Effects of Classroom Design on Student Performance

Using Automatic Speech Recognition to Train English Language Learners in Solfege: A Task Based Learner’s Tool