Using Automatic Speech Recognition to Train English Language Learners in Solfege: A Task Based Learner’s Tool
Overview
 A core part of university music programs is ear training and sight singing. The task for students in these classes is to hear and transcribe music onto paper as well as sight singing.  This often involves the use of solfege to represent the relationship between the tonic and the other notes of the corresponding scale.    
	In addition to these difficulties, English language learners (ELLs) also have the additional challenge of producing the correct phonetics that the English language version of solfege requires. As many students are likely to be familiar with solfege from having studied it in their L1s, the probability of detrimental language transfer is high. This may be an issue because their native language either has no equivalent to a certain English solfege sound, as in the Japanese シ “shi” for the major 7th and the English “ti” for the same interval, or as with a similar issue in Korean where the lack of a /f/ or /ɸ/ phonetic results in the perfect 4th interval being represented by 파 /pa/ instead of the /fa/ which is found in English. There are also cases where phonetic representations of the same interval are similar, but distinct enough to prevent a full sense of unity in choral music, as shown by the /ɸ/ and /f/ consonant difference in the perfect fourth interval in Japanese and English solfege respectively. The consequences of this are that students may receive lower than expected marks on aural skills examinations where they use the incorrect phoneme to represent a note despite full musical accuracy or when they are in choral ensemble and the phonetic productions they sing with prevent them from fully locking in with the group.  
	Outside of academia, the need for phonetic accuracy on the bandstand is also present.  An anecdotal story of mine illustrates the need for a system that can help non-native English speakers improve their phonetic accuracy when singing in English. In the summer of 2019, a band I was playing in was performing in a jazz festival in Japan. Another band on the ticket that day featured an ELL singer who was singing a tune of her own which was called “Love me Gently”. Despite being an excellent singer, her lack of phonetic accuracy in regards to the English  /l/ and /v/ (both are absent from Japanese phonology) resulted in the word love being pronounced like the English word rub. This had the unfortunate effect, to the ears of those familiar with these distinctions, of completely changing the mood of the somber ballad. It is my belief that the theoretical program described below would help people such as this singer improve their ability to sing in a non-native language.  
Introduction
The concept of using CALL programs to help students achieve a mastery over the phonetic systems of a given language has a long history (Holmes 1998, Lambacher, 2010, Tsubota, 2004, Engwall, 2011, Łodzikowski, 2021, Saito, 2007, Evers, K., & Chen, S. 2020). Pennington (2010) offers one of the stronger justifications for CALL based phonetic training. In the article, Pennington notes how post critical period second language learners often face diminishing returns with their language learning efficiency when they reach the intermediate level. CAP programs can deal with this issue, as also pointed out by Tsai (Tsai 2019), by increasing the autonomy of language learners.
In Pennington (2010), the author lays out a ten point step by step guide to building a computer assisted pronunciation (CAP) program.
Figure 1
Suggestions for Improving CAP Pedagogy  Pennington (2010)
1. Without a theoretical basis to work from, most CAP programs fall into a type of “low level performance phenomenon” (Pennington, 2010). She makes an excellent point about moving away from a segmental view of pronunciation towards a more prosodic phonology. Although my system primarily deals with segmental solfege phonemes, Pennington’s points on intonation, stress, and rhythm are certainly relevant. As will be discussed below, the role of intonation in a system which is also grading for musical accuracy requires acknowledging the nature of vocal music performance.
2.  Establishing a baseline for pronunciation is of fundamental importance to any CAP program, as all iterations of the task will be judged against it. A standard “North American” and “UK” accent model would be a start, but including more would be ideal. As Thomson (2012) shows, learners benefit from perceptual training tasks which feature a range of different voices.  I think that this shows the importance of not only having a wide range of English accent models to work with, but also shows the need to have male and female voice models so that learners have the ability to tailor the program's output and evaluation system to their particular voice   while accounting for the dimorphism between typical male and female voice ranges and timbres.  
	3.  Due to the rather niche nature of the program’s target audience, the goal is clear.  Their goal is to be able to produce both accurate pitch based on the western tonal music system and to produce consonant and vowel phonemes in line with the English solfege system.  Although Pennington cautions against designing a system based on the production of individual sounds, this might be an appropriate task based on the larger context in which this program is situated. For more advanced learners, this problem can be addressed by designing tasks based on the singing of melodies. This would allow for a large degree of difficulty scaling if the range of tested melodies encompasses such tunes as simple nursery melodies to more complex melodies of jazz standards.
4. In terms of specific targets, it is important to build in leeway for how the human ear hears and comprehends pitch and phonetic input. The human ear doesnt require the note A to be sung at exactly 440hz to sound in tune (barring the extreme minority of the population with perfect pitch recognition), and the pitch test should allow for an acceptable range of pitch to reflect this. When dealing with the phonetic side of the program, the judgments should also reflect the natural variation between different speakers of the same language and accent, and shouldn’t demand that the user sound exactly like the target model.This can be further addressed by allowing the user to be able to manipulate both the phonetic and pitch correctness thresholds. This will allow both heavily L2 accented users and weaker singers to gradually build their competencies by progressively making the task more difficult.
5. Following Pennington’s advice, I believe there are a number of additions to the main task which could benefit a user. A pre-production task would be a SDT (sound discrimination task) where the user is primed before the start of the task by listening to and identifying their L1 and English’s version of the same solfege syllable that they are going to sing. The goal of this is to raise phonetic awareness in the user immediately prior to the task. While the suggested in-production training modifications are sound, the nature of a music based task means that the prosody of the phonemes are determined not by an accent or linguistic convention, but by the rhythm of the notes of the music. Nevertheless, her use of arrows to show intonation could be used to give live pitch feedback to someone using the program. For post-production training, a record of any consistent errors could be collected in order to build a learner profile. For example, if a user is consistently singing a major sixth in either direction incorrectly, or is struggling with the /l/ and /r/ distinction, then the program can make a note of this and notify the user along with providing some feedback relevant to their particular struggle so that they may improve.
6. In step 6, Pennington recommends linking the CAP program to the larger language as a whole. A good way of linking this would be to take advantage of the large amount of English loan words in Japanese which will almost always have their pronunciations altered to match the syllabric nature of the Japanese phonetic system. Examples include;
English
Japanese
Hepburn translit
Applicable to
Alcohol
アルコール
arukouru
Major 2nd, 6th
Solar
ソーラー
soura
Perfect 5th
ticket
チケット
Chikketo
Major 7th
London
ロンドン
Rondon
Major 6th
By linking the target phonetics to the larger language context, the user's phonetic awareness would be raised.
7. Point 7 stresses the importance of not isolating a CAP program from an established curriculum or so heavily relying on the program that it becomes the curriculum. Since the target audience of this program are students who are attending a music program at a university, it is safe to assume that they are already involved in an academic and social situation which is inducive to the same sets of skills that we are looking to improve. Therefore this program exists in an appropriate supplementary space.
8. See 10
9. Raising awareness between L1 and L2 differences is something which will be addressed below.
10. Point ten deals with the concept of interactivity. The point to take from this is that the CAP program should be designed in a way where users can make the most efficient use of their time by choosing what they want to work on, rather than forcing them through a set curriculum.  A great example of an interactive ear training program comes from the website  www.tonedear.com (accessed 12/11/2022)
Figure 2
Interval Ear Training Interface from www.tonedear.com
The user interface is exceptionally clear, and allows the user to choose exactly which intervals they wish to work on while also allowing them to focus on interval direction and to modify difficulty by selecting the tempo of the played intervals.   
				Learners
	For the sake of this paper, I wish to limit the scope of my target audience to L1 Japanese speakers. When designing a program such as the one described by this paper, I believe it is good practice to develop language profiles for as many individual languages as possible.  The phonetic systems of each individual language will differ greatly from each other, and therefore demand individualized attention. By developing a phonetic profile for each language, we can train the program to be aware of where errors are most likely to occur, as well as educate the user about where key differences in the phonetic output between the two languages occur.  As discussed in class, relying on the concept of standardized language pronunciation is likely to result in the alienation of minority accents.  It is therefore necessary to avoid taking anything resembling a “one size fits all” (Rogerson-Revell, 2021) approach.   
	In order to offer an as detailed account as possible, I will also be limiting the scope of this paper to the major scale. I wish to do this because I believe one scale will offer plenty of material to work with while allowing enough space for a deeper analysis of phonetic differences and application functions. The basic premise will be applicable for any scale.
Returning to the idea of language profiles, many researchers have taken similar approaches when developing CALL programs for individual languages. In their paper Lambacher(2010) developed a CALL system for Japanese L1 English learners and provided an in-depth phonetic analysis of key areas where the two languages differ and where learners often struggle. I will draw heavily from their analysis and will point out where the most relevant phonetics differences for our learners in relation to solfege are.
Example 1  /ɺ/ vs /l/ 
	Perhaps the most difficult aspect of the phonetics for students to master is the distinction between these two sounds.  The distinction is important for this context due to the phonetic qualities of the major 2nd (re), minor 2nd (ra), major 6th (la), and minor sixth (le).  The inability to clearly articulate between the /ɺ/ and /l/ sounds in a musical setting could, as noted above,  have drastic consequences for the student. 
Example 2: /t/ and /ʃ/ when combined with the vowel /i/
As a language with fairly strict consonant-vowel phonology, Japanese words are usually restricted to a combination of 15 consonants and 5 vowels (Shibatani, 1990 P.137). One notable CV pattern missing from Japanese phonology is the /ti/ sound, which is replaced by the sound /ʃi/ in the major 7th degree. As the sound used in English solfege is completely absent from Japanese phonology, this is an area which requires close attention.
Example 3: /f/ and /ɸ/
Another example of a phonetic area to focus on is the difference between the English /f/ and the Japanese /ɸ/ as it pertains to the perfect fourth of the solfege system.
Action and Technology
Sound discrimination tasks (SDTs) have been shown to be an effective way of honing in on language specific phonetics for English speaking learners of German (Thomson, 2011). A similar system could be applied to a solfege program by having an SDT for L1 vs L2 for each of the 7 syllables of the major scale. Martin (2020) showed the importance of the ability to distinguish between new L2 sounds and the building of new perceptual categories in being able to produce L2 sounds. This provides further evidence for including SDTs as a stepping stone activity to prepare the user for the main task.
The program will consist of two systems which judge two separate criteria that the user produces. From the start the user will design their task by selecting which intervals they wish to practice singing. By allowing them to design the criteria of their practice, the users are able to tailor the program for their individual needs. They will then begin the task by pressing a start button. When they do this, they will elicit a I-V-I cadence from a randomly selected key that will establish where the tonic note is. From there, the program will then ask the user to sing one of the preselected intervals.
	The program then takes the input from the user and analyzes it in two ways. For the musical criteria, it’ll look at the pitch of the two notes sung in order to judge whether they are within an acceptable range. For example, if the program is asking for a perfect fifth above to be sung from an A, the program will look to see if the starting note begins at one of the frequencies assigned to an A (110hz, 220 hz, 440hz) followed by one assigned to an E (164hz, 329hz, 659hz). If the sung pitches match the target frequencies and have the correct relationship between each other, then the program will show a correct notification for part 1 of the task.The second part of the program will analyze the phonetic output of what was sung in order to judge if the user is producing the correct consonants and vowels.  
When users first begin the program, they will start with simple one interval tasks. However, in order to replicate the real world use of solfege singing, the program must be able to account for not only longer sequences of intervals, but also a range of rhythmic and tempo diversity in order to increase its authenticity.
As noted by Menzel (2001), any pronunciation training system should provide sufficient feedback for the user to improve. The users need to be aware of if they are achieving the task. When they fail to do so, why an incorrect judgment was rendered along with strategies on how to improve in future task iterations, are to be supplied. A template format I came across when reviewing CAP programs originates from a paper by Olson (2014). When Spanish language learners failed at correctly pronouncing words, they were prompted by both visual aids in the form of spectrograms along with reflective questions which aimed to increase user awareness of the differences between a model native Spanish speaker’s spectrogram and their own.  
Figure 3A and 3B
Olson’s (2014) Varieties of Visual Feedback
Figure 4
Engwall’s (2011) Virtual Teacher Animation
In addition to reflective questions and spectrogram comparisons, another source of useful feedback for the user comes in the form of virtual teachers. Engwall ( 2011) took a more physiological approach to feedback by building a model of a human head which users could manipulate in order to analyze the different tongue movements involved with phonetics found in Swedish. Massaro and Light (Massaro & Light, 2003) also used a virtual teacher to show that Japanese learners of English benefited from visual feedback in developing an /r/ and /l/ distinction.
Tsubota (2004) took a different approach to feedback by developing error models consisting of 79 error patterns for Japanese learners of English. Using a combination of ASR based pronunciation tasks, the program would then develop individualized learner profiles which highlighted consistent errors and offered feedback for further improvement. I believe adopting a learner profile consisting of their common mistakes will go a long way to addressing the issues some ELL students had as shown by Derwing (2003), specifically that some ELL thought their pronunciation was poor due to lacking the metalinguistic awareness to identify their mistakes.
Such a way of delivering feedback could easily be altered to fit the needs of the theoretical program described in this paper by building an animation database for each of the seven solfege syllables. Combined with the earlier mentioned reflective questions and spectrogram analysis, the users of this system will be provided with three separate avenues of feedback to help them build phonetic competency.
The program described hitherto has focused on training the user to develop their skills in the context of solfege based vocal productions. However, music with lyrics contain a far more diverse range of phonemes than the 7 syllable solfege system. L2 singers need to be able to produce lyrical content clearly and accurately in order to avoid an unfortunate situation like the one which occurred in the introduction. When reviewing a few different singing training programs found on the internet, I came across the website https://www.tonegym.co/tool/item?id=sight-singing-trainer (accessed 12/11/2022) which provides a great template to address this situation.
Figure 5
Tonegym’s Visual Singing Trainer
Although this program only checks for pitch accuracy, with the addition of ASR I think the format can be used to develop a more advanced version of the program described hitherto.
A final avenue worth exploring is the role of the IPA in raising phonetic awareness in the users. As shown by Jensen (2003), using the IPA to train singers to sing in another language which they have no history of study in has been a staple of some contemporary conservatories for some time. Jensen argues that by using the IPA, singers circumvent the cognitive and semantic baggage of a word being a word, while also developing new neural pathways for pronunciation. While Jensen puts forward interesting ideas, I think the fundamental claim that the IPA can help ELL singers is sound. This is especially true in the case of raising phonetic awareness between the L1 and L2, and synchronizes nicely with the SDT previously mentioned.
Theory
Figure 6
Deng(2009): Percentage of Pronunciation Articles over 5 Years
Deng (2009) claims that as a result of the importance that is placed on syntax and morphology in the classroom, pronunciation is relegated to a secondary concern. This seems to be reflected in academic research, as shown by another paper by Deng (2009) in which it was shown how little the topic of pronunciation is covered in a range of SLA related journals. My own experience as an ESL / EFL teacher, and the prevalence of focus on form based curriculum, generally aligns with Deng’s findings. From these separate sources, I believe it wise to assume that any potential user of this program will have no to very limited formal training in English pronunciation.
Chapelle (2009) explored a number of different theories and their applicability to CALL programs. A theory which fits the use of SDTs would be the input processing theory. While the focus of the theory is on the act of noticing form and meaning, it seems to me that the theory can be adapted to include the noticing of pronunciation as well. It follows that if the acquisition and internalization of form is preceded by the noticing of form, then it is also likely that the acquisition and internalization of pronunciation is also preceded by the noticing of phonetic characteristics. This is also in line with the theoretical work done by Schmidt (1990) with regards to the noticing hypothesis.
At the core of the program described in this article is a task for the user to complete. It follows from this that the theoretical grounding of it should be strongly based on current research on task based language learning (TBLL). Luckily East (2021) has recently published a detailed book on where the theory stands in its current form. One area which I found particularly relevant in his book was the work done by Ellis and Shintani (2014), where they tagged TBLL to skill acquisition theory. They see the process of SLA as moving from a state of declarative knowledge which relies on a large use of premeditated cognitive processing, to a state of procedural knowledge, with an ultimate goal of automaticity. I see this theory as particularly relevant to the program described in this paper because of the dual nature of the skills being honed in it. The users will need to simultaneously move from a state of rather intense practice in mastering both interval singing and English pronunciation, where the qualities and color of both the music and phonetics need to be noticed, ingrained, produced, and finally internalized.
After a few decades of the use of CAP programs as learning tools, several authors have criticized the habit that developers exhibit in ignoring any need to tie these programs to any recognized theory of SLA. One of the most vocal critics of this phenomena is Martha Pennington (2010), who takes aim at the theoretical shortcomings of previous programs from a socio-cultural perspective. Her criticism of CAP programs which I believe to be most relevant to my own project is the habit of hyper segmentalization of phonetic output which most CAP programs feature. She correctly points out that a reliance on this kind of exercise ignores the more crucial suprasegmental phonology present in spoken language; which are more often than not the area where L2 learners struggle.
However, scholars in the field of vocal pedagogy don’t seem to be quite as ready to write off more segmented phonetic training. As noted by De’ath (2001), speech differs from melody in that “the meter and rhythm of melody prescribes a more or less definite spacing of the succession of phonetic events relative to one another as a text unfolds in singing. Things must happen at precise times, rather than floating freely in the time continuum, as is the privilege of the orator.”
To make a musical analogy, anyone can play a difficult passage note by note. It’s when you add in the musical equivalent to suprasegmental phonology (dynamics, intonation, rhythm, etc) that a passage becomes difficult. I have attempted to address these concerns by designing a more advanced version of the task where the users have moved away from solfege, and are instead graded on how well they can pronounce the lyrics to common songs while maintaining correct pitch.
Conclusion
As we have seen, pronunciation remains both an ever present and yet understudied linguistic phenomenon that provides language learners with a myriad of challenges. While I believe this program could be a useful tool for ELL singers, it is not without its own limitations. As sociolinguistics would argue, we can’t remove the learner/singer from the social context which they inhabit. This is especially true when it coincides with a field such as music where questions of aesthetic taste often make judgements about the relative desirability of different learner accents.
References
Akayoğlu, S. (2019). THEORETICAL FRAMEWORKS USED IN CALL STUDIES: A SYSTEMATIC REVIEW. Teaching English with Technology, 4, 104-118.
Bashori, M., van Hout, R., Strik, H., & Cucchiarini, C. (2022). ‘Look, I can speak correctly’: learning vocabulary and pronunciation through websites equipped with automatic speech recognition technology. Computer Assisted Language Learning, 1-29. https://doi.org/10.1080/09588221.2022.2080230
Chapelle, C. (2009). The Relationship Between Second Language Acquisition Theory and Computer-Assisted Language Learning. The Modern Language Journal 93(Focus Issue).
Chapelle, C. A. (2009). The Relationship Between Second Language Acquisition Theory and Computer-Assisted Language Learning The Modern Language Journal, 93(Focus Issue), 14.
De'ath, L. (2001). Language and Diction The Merits and Perils of the IPA for Singing. The Official Journal of the National Association of Teachers of Singing, 57, 57-68.
Debevc, M., Weiss, J., Šorgo, A., & Kožuh, I. (2019). Solfeggio learning and the influence of a mobile application based on visual, auditory and tactile modalities. British Journal of Educational Technology, 51(1), 177-193. https://doi.org/10.1111/bjet.12792
Deng, J., Holtby, A., Howden-Weaver, L., Nessim, L., Nicholas, B., Nickle, K., Pannekoek, C., Stephan, S., & Sun, M. (2009). English pronunciation research: The neglected orphan of second language acquisition studies? Edmonton, AB: Prairie Metropolis Centre.
Derwing, T. (2003). What Do ESL Students Say About Their Accents? The Canadian Modern Language Review 59(4), 547-566.
Derwing, T. M., & Munro, M. J. (2005). Second Language Accent and Pronunciation Teaching: A Research-Based Approach. TESOL Quarterly, 39(3). https://doi.org/10.2307/3588486
East, M. (2021). Foundational Principles of Task-Based Language Teaching. Routledge.
Ellis, R., & Shintani, N. (2014). Exploring language pedagogy through second language acquisition research. Routledge
Engwall, O. (2011). Analysis of and feedback on phonetic features in pronunciation training with a virtual teacher. Computer Assisted Language Learning, 25(1), 37-64. https://doi.org/10.1080/09588221.2011.582845
Engwall, O., & Bälter, O. (2007). Pronunciation feedback from real and virtual language teachers. Computer Assisted Language Learning, 20(3), 235-262. https://doi.org/10.1080/09588220701489507
Evers, K., & Chen, S. (2020). Effects of an automatic speech recognition system with peer feedback on pronunciation instruction for adults. Computer Assisted Language Learning, 1-21. https://doi.org/10.1080/09588221.2020.1839504
Friend, D., & Lumsden, M. (2008). The Phonemic Alphabet in English. ReCALL, 8(1), 47-49. https://doi.org/10.1017/s095834400000344x
SPEECH TOOLS AND TECHNOLOGIES. Language Learning & Technology, 13(3). http://llt.msu.edu/vol13num3/emerging.pdf
Holmes, B. (2010). Initial Perceptions of CALL by Japanese University Students. Computer Assisted Language Learning, 11(4), 397-409. https://doi.org/10.1076/call.11.4.397.5674
Jensen, K. (2003). Teaching the ESL Singer. The Official Journal of the National Association of Teachers of Singing 59(5), 415-419.
Jun Deng, A. H., Lori Howden-Weaver, Lesli Nessim, Bonnie Nicholas, Kathleen Nickle, Christine Pannekoek, Sabine Stephan, Miao Sun (2009). English pronunciation research The neglected orphan of second language acquisition studies. PCERII Working Paper Series WP05-09.
Lambacher, S. (2010). A CALL Tool for Improving Second Language Acquisition of English Consonants by Japanese Learners. Computer Assisted Language Learning, 12(2), 137-156. https://doi.org/10.1076/call.12.2.137.5722
Łodzikowski, K. (2021). <Association between allophonic transcription tool.pdf>. Language Learning & Technology, 25(1), 20-30.
Martin, I. (2020). Pronunciation development and instruction in distance language learning. Language Learning & Technology, 24(1), 86-106.
Menzel, W., Herron, D., Morton, R., Pezzotta, D., Bonaventura, P., & Howarth, P. (2001). Interactive pronunciation training. ReCALL, 13(1), 67-78. https://doi.org/10.1017/s0958344001000714
Olson, D. J. (2014). <BENEFITS OF VISUAL FEEDBACK ON SEGMENTAL.pdf>. Language Learning & Technology.
Pennington, M. C. (2010). Computer-Aided Pronunciation Pedagogy: Promise, Limitations, Directions. Computer Assisted Language Learning, 12(5), 427-440. https://doi.org/10.1076/call.12.5.427.5693
Pennington, M. C. (2019). ‘Top-Down’ Pronunciation Teaching Revisited. RELC Journal, 50(3), 371-385. https://doi.org/10.1177/0033688219892096
Rogerson-Revell, P. M. (2021). Computer-Assisted Pronunciation Training CAPT Current Issues and Future Directions. RELC Journal, 52(1), 189-205. https://doi.org/10.1177/0033688220977406
Saito, K. (2007). The influence of explicit phonetic instruction on pronunciation teaching in EFL settings: The case of English vowels and Japanese learners of English. The Linguistics Journal, 3(3), 16–40.
Schmidt, R. (1990). The role of consciousness in second language learning. Applied Linguistics, 11(1), 17–46
Shibatani, M. (1990). The Major Languages of East and South-East Asia (B. Comrie, Ed.). Routledge.
Thomson, R. I. (2012). Improving L2 listeners’ perception of English vowels: A computer-mediated approach. Language Learning, 62(4), 1231–1258.
Thomson, R. I. (2011). Computer assisted pronunciation training: Targeting second language vowel perception improves pronunciation. CALICO Journal, 28(3), 744–765.
Tsai, P.-h. (2019). Beyond self-directed computer-assisted pronunciation learning: a qualitative investigation of a collaborative approach. Computer Assisted Language Learning, 32(7), 713-744. https://doi.org/10.1080/09588221.2019.1614069
Tsubota, Y., Dantsuji, M., & Kawahara, T. (2004). An English pronunciation learning system for Japanese students based on diagnosis of critical pronunciation errors. ReCALL, 16(1), 173-188. https://doi.org/10.1017/s0958344004001314
