Category Archives: acoustics

Seiji ADACHI & Masashi YAMADA : An acoustical study of sound production in biphonic singing, Xöömij

Standard

Seiji ADACHI & Masashi YAMADA : An acoustical study of sound production in biphonic singing, Xöömij

Abstract A theory that the high melody pitch of biphonic singing, Xöömij, is produced by the pipe resonance of the rear cavity in the vocal tract is proposed. The front cavity resonance is not critical to the production of the melody pitch. This theory is derived from acoustic investigations on several three-dimensional shapes of a Xöömij singer’s vocal tract measured by magnetic resonance imaging. Four different shapes of the vocal tract are examined, with which the melody pitches of F6, G6, A6, and C7 are sung, along with the F3 drone of a specific pressed voice. The second formant frequency calculated from each tract shape is close to the melody pitch within an error of 36 cents. Sounds are synthesized by convolving a glottal source waveform provided by the Rosenberg model with transfer functions calculated from the vocal tract shapes. Two pitches are found to be successfully perceived when the synthesized sounds are listened to. In a frequency range below 2 kHz, their spectra have a strong resemblance to those of the sounds actually sung. The synthesized sounds, however, fail to replicate the harmonic clustering at 4–5 kHz observed in the actual sounds. This is speculated to originate from the glottal source specific to the “pressed” timbre of the drone.

REFERENCES

  1. 1. B. Chernov and V. Maslov, “Larynx—Double-sound generator,” Proc. 11th Int. Conf. of Phonetic Science, Tallinn, Estonia, pp. 40–43 (1987). Google Scholar
  2. 2. Q. H. Trân and D. Guillou, “Original research and acoustical analysis in connection with the Xöömij style of biphonic singing,” in Musical Voices of Asia (Heibonsha, Tokyo, 1980), pp. 162–173. Google Scholar
  3. 3. T. Muraoka, K. Wagatsuma, and M. Horiuchi, “Acoustic analysis of the Mongolian singing Xöömij,” Proc. Fall Meet. Acoust. Soc. Jpn., pp. 385–386 (1983) (in Japanese). Google Scholar
  4. 4. S. Adachi, S. Kinoshita, H. Tamagawa, and M. Yamada, “MRI measurement of the vocal-tract shape while singing Xöömij and the synthesis based on the acoustic tube model,” Tech. Rep. Musical Acoustics MA96-10, 9–16 (1996) (in Japanese). Google Scholar
  5. 5. S. Adachi, S. Kinoshita, T. Komoike, H. Tamagawa, and M. Yamada, “Study on sound production in Xöömij—Part 1: MRI measurement of the vocal-tract shape and the synthesis based on the acoustic tube model,” Proc. Spring Meet. Acoust. Soc. Jpn., pp. 645–646 (1996) (in Japanese). Google Scholar
  6. 6. T. Komoike, S. Kinoshita, M. Yamada, S. Adachi, and I. Nakayama, “Study on sound production in Xöömij—Part 2: Perceptual experiment with synthesized sound,” Proc. Spring Meet. Acoust. Soc. Jpn., pp. 647–648 (1996) (in Japanese). Google Scholar
  7. 7. S. Adachi and M. Yamada, “An acoustical study of sound production in biphonic singing, Xöömij,” Proc. 1997 Japan-China Joint Meeting on Musical Acoustics, pp. 21–26 (1997). Google Scholar
  8. 8. B. H. Story, I. R. Titze, and E. A. Hoffman, “Vocal tract area functions from magnetic resonance imaging,” J. Acoust. Soc. Am. 100, 537–554 (1996). Google ScholarScitation, CAS
  9. 9. J. Dang, K. Honda, and H. Suzuki, “Morphological and acoustical analysis of the nasal and the paranasal cavities,” J. Acoust. Soc. Am. 96, 2088–2100 (1994). , Google ScholarScitation, CAS
  10. 10. R. Caussé, J. Kergomard, and X. Lurton, “Input impedance of brass musical instruments—Comparison between experiments and numerical models,” J. Acoust. Soc. Am. 75, 241–254 (1984). , Google ScholarScitation
  11. 11. M. M. Sondhiand J. Schroeter, “A hybrid time-frequency domain articulatory speech synthesizer,” IEEE Trans. Acoust., Speech, Signal Process. ASSP-35, 955–967 (1987). , Google ScholarCrossref
  12. 12. A. E. Rosenberg, “Effect of glottal pulse shape on the quality of natural vowels,” J. Acoust. Soc. Am. 49, 583–590 (1971). , Google ScholarScitation
  13. 13. The synthesized tones can be heard on the World Wide Web at http://www.hip.atr.co.jp/∼adachi/Xoomij/Sound/. , Google Scholar
  14. 14. J. Dangand K. Honda, “Acoustic characteristics of the piriform fossa in models and humans,” J. Acoust. Soc. Am. 101, 456–465 (1997). Google ScholarScitation, CAS
  15. 15. D. G. Childersand C. K. Lee, “Vocal quality factors: Analysis, synthesis, and perception,” J. Acoust. Soc. Am. 90, 2394–2410 (1991). , Google ScholarScitation, CAS
  16. 16. K. Ishizakaand J. L. Flanagan, “Synthesis of voiced sounds from a two-mass model of the vocal cords,” Bell Syst. Tech. J. 51, 1233–1268 (1972). , Google ScholarCrossref
  17. 17. M. Yamada, “Stream segregation in Mongolian traditional singing, Xöömij,” Proc. Int. Sym. Musical Acoustics, Dourdan, pp. 540–545 (1995). , Google Scholar
  18. 18. J. L. Flanagan, Speech Analysis, Synthesis and Perception, 2nd ed. (Springer-Verlag, New York, 1972), Chap. 3, pp. 36–38. Google Scholar
  19. © 1999 Acoustical Society of America.

The Journal of the Acoustical Society of America 105, 2920 (1999); https://doi.org/10.1121/1.426905

https://asa.scitation.org/doi/10.1121/1.426905

Marie-Cécile BARRAS & Anne-Marie GOUIFFES :The Reception of Overtone Singing by Uninformed Listeners

Standard

journal of interdisciplinary music studies
spring/fall 2008, volume 2, issue 1&2, art. #0821204, pp. 59-70
•Correspondence: M.-C. Barras, Univ. Bordeaux IV (IUFM), 160 Avenue de Verdun – BP 90152, 33705
Mérignac Cedex, France; e-mail: marie-cecile.barras@aquitaine.iufm.fr

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.460.541&rep=rep1&type=pdf

The Reception of Overtone Singing by Uninformed
Listeners


Marie-Cécile Barras1 and Anne-Marie Gouiffès2
1 University of Bordeaux (IUFM d’Aquitaine, Bordeaux IV and Department of Music,
Bordeaux III) 2 Jeannine Manuel Bilingual School, Paris and OMF, University of Paris IV-Sorbonne
Background in acoustics and psychoacoustics. Overtone singing is a vocal technique by
which a single source produces two melodic pitches simultaneously. When an unprepared
listener hears a recording of overtone singing, the first question is usually: “How was the sound
produced?” The level of auditory education may play a role in the perception of this
phenomenon.
Background in cultural studies. A study of a cultural initiation includes subjective aspects of
reception. Listeners are presented with an unknown vocal technique from a popular culture. The
majority of listeners will experience it as a real cultural confrontation with an unknown world
of sound. The initial phase of acculturation is therefore the most salient.
Aims. Our objective is to determine how the listener reacts to an unknown musical
phenomenon, in both its perceptive and cultural dimensions.
Main contribution. The present study concerns the listener’s reception of overtone singing.
The musical corpus includes styles of singing as diverse as those found in Tuva and Mongolia,
in South Africa (Xhosa women), or among the Dani people of Irian Jaya in the Indonesian
territory of New Guinea. Psychoacoustic tests were given to 338 adolescents (10–15 years old).
Implications. In order to understand the mechanism ‘from the inside’, all the listeners, who
have undergone the tests, then try overtone singing. Our study opens the door to a
transformation in the way one listens. It encourages an openness to other artistic and cultural
dimensions through a real education of the ear.

Keywords: Overtone singing, ethnomusicology, pitch perception, psychoacoustics, reception, cultural
studies













M.-C. Barras and A.-M. Gouiffès

60
Introduction
The present study concerns the listener’s reception of overtone singing, a vocal
technique by which a single source produces two melodic pitches simultaneously.
One of the pitches is the generally stable fundamental, which serves as a sort of drone;
the other results from the shifting emphasis of different harmonics. This shifting
emphasis has a melodic intent. The various harmonics are obtained through a
modification (by pronouncing the vowels) of the singer’s resonators (the pharyngo-
buccal cavity acts as a resonator of variable volume) and a particular use of the breath
with a forceful contraction of the abdominal and neck muscles. This is characteristic
for traditional techniques of overtone singing. In a different way, the
ethnomusicologist Trân Quang Hai thinks that the singer’s nose can be used as a
natural shutter in order to filter undesirable frequencies with no more additional
efforts than while speaking (Zemp, 1989; Zemp & Trân, 1991), but this experimental
possibility should not be confused with the guttural overtone singing (or ‘throat
singing’ as Mongols and Tuvans call it).
This vocal technique which is very original has been observed, recorded and studied
in depth for about thirty years by ethnomusicologists and acousticians (Léothaud,
1989). A short sequence of analysis presented in an interactive process can be found
at: http://www.mae.u-paris10.fr/crem-cnrs/Animations/diphonique/hai1.html, or at the
website of the ‘Musée de l’Homme’ of Paris: http://www.ethnomus.org1 (click
‘Enter’, choose ‘Réalisations Multimedia’, ‘Les clefs d’écoute’, ‘Le chant
diphonique’). The analysis of the frequencies can be visualized immediately as the
singer physically emits overtone singing thanks to a special computer programme
(e.g. ‘sygyt software’, http://www.sygyt.com2, Maass & Saus, 2003) which draws
sonagrams (or spectrograms), thus illustrating the acoustic specificity of the sound.
Though this vocal technique is now well known by specialists (Walcott, 1974; Trân,
1975, 2002) and also used in contemporary music, overtone singing is still unknown
to the general public. The first listening experiment that we did in music classes
triggered varied and surprising reactions. This music raised issues linked both to
perception and cultural background. Therefore we felt that overtone singing was a
good basis to explore the psychoacoustic experience and the cultural initiation that it
entails.
So the main question was: “How does an uninformed listener react to an unknown
musical phenomenon?” – in perceptive and cultural dimensions. The Reception of Overtone Singing

61
Method
Musical corpus
The existence of overtone singing is now acknowledged in places beyond Mount
Altai, in Central Asia, among the following populations: Mongolian, Tuva, Khakash,
Altaian, and among the Bashkirs (West of the Urals). What is less well known is that
certain Xhosa women from South Africa perform overtone singing in extremely low
vocal registers and that a recording (by John Levy, 1967, cited in Zemp & Trân, 1991,
and in Trân, 2002) of a singer also shows that overtone singing existed in Rajasthan.
The Dani people of Irian Jaya in the Indonesian territory of New Guinea practise a
unique kind of triphonic singing that has yet to be fully researched.
We wanted to let the students listen to several techniques in order to explore the
diversity present in different continents and allow them to discover this phenomenon
in other new places. Thus the musical corpus includes styles of singing as diverse as
those found in Tuva and Mongolia (Desjacques, 1993), in South Africa (with
women’s voices singing in the lower register), or among the Dani people of Irian
Jaya.
The performance is executed by a sole vocalist, and not by a vocal ensemble which
produces a harmonic from the fusion of several voices, as observed in the polyphonic
singing of Sardinia (Lortat-Jacob, 1998) or with the deep bass voices of the monks of
the Gyütö monastery in Tibet, in exile in India (Trân, 1999).
Students listened to four extracts that were each less than one minute long:
• EX1 (= Extract 1), 0’57, Mongolia, xhöömij (Maison des Cultures du
monde, 1989, N°6: ‘Khuren khalgaatai delguur’).
The Khoomei (or xhöömij or xöömii… – transliterated in different ways by different
authors), from a Mongolian word that means ‘throat’ or ‘pharynx’, is generally
translated as ‘throat-singing’3 and is the name of a particular soft-sounding style (with
clear harmonics) as well as the general term for throat-singing. The fundamental pitch
in the khoomei style is higher (in a baritone register) than that of kargiraa (see EX3).
• EX2 (= Extract 2), 0’34, New Guinea, Dani people (Petrequin, 2001,
N°18–19–20: ‘Lolo-Lou, Habema’).
The Extract 2 (Dani people) is an ethnological document and not an
‘ethnomusicological’ record. It is important to note that this extract was an exception
to the rule, because the singer was producing a triphonic sound.
• EX3 (= Extract 3), 0’56, Tuva, ‘Dag kargyraa’ (Zemp, 1996, N°37:
Russie–Kyzyl, République de Tuva).
Kargiraa style, from an onomatopoeic word that means in Tuvan ‘to wheeze’, ‘to
speak in a hoarse or husky voice’ (Alekseev, Kirgiz & Levin, 1990) is characterized
by an extremely low fundamental pitch (frequency between 55 Hertz [Hz]–65 Hz). M.-C. Barras and A.-M. Gouiffès

62

• EX4 (= Extract 4), 0’52, South Africa, Xhosa woman (Ngqoko, Lumko
district) (Zemp, 1996, N°36-a: Nondel’ ekhaya [‘Married at home’]; in the
style umngqokolo ngomqangi by Nowayilethi Mbizweni).
Umngqokolo (overtone singing in South Africa) ngomqangi style is ‘inspired by the
buzzing of a beetle held in front of the mouth [of the performer], with selection of
harmonics in the buccal cavity’ (Trân, in Zemp, 1996).
This heterogeneous musical corpus could make the identification of this phenomenon
more complicated for the listeners (because of the stylistic diversity of the pieces), but
we wanted to offer a pedagogical and cultural mind opener that was as wide as
possible in our test.
Note that overtone singing is traditionally performed by men. The ethnomusicologist
Trân Quang Hai thinks4 this tradition is based on acoustical principles (Zemp & Trân,
1991). One has to produce a fundamental tone within the range of about 150 Hz (or
about 200 Hz at most) to perceive harmonics up to the 13th harmonic (a vibration of
about 2000 Hz, or about 2600 Hz), but a 150 Hz emission is not within the usual
register of women’s voices. For instance, in the test, the Tuvan melody in ‘Kargiraa
technique’ (EX3) uses harmonics number 8, 9, 10 and 12.
The Xhosa woman in South Africa (when using the lower register of her voice; EX4),
is an exception, and all the pupils thought this was a man’s voice in this extract.
Psychoacoustic test: subjects, listening experiment and questionnaire
Psychoacoustic tests were given to a broad public (adolescents and young adults
between the ages of 10 and 30, with hypothesis of standard hearing). The results in
this paper are based on the answers of 338 adolescents between the ages of 10 and 15.
These adolescents belong to an international school “École Active Bilingue Jeannine
Manuel” offering bilingual education in French and English (about 75% students use
at least two languages at home and practice bilingualism daily). The audience
included all students from 14 classes taught by Anne-Marie Gouiffès.
The test took fifty minutes (including setting up and instructions) which stands for a
full weekly music lesson. The group of adolescents listened together to the four
extracts in the music class. No communication between them and no visible
expression were required (but this was impossible for them – see ‘Results / Student’s
reactions’).

The Reception of Overtone Singing

63

Figure 1. Ages with the distribution of girls and boys.
An open questionnaire was distributed at the beginning of the session, followed by
more precise questions that lead to investigations of the structure of the sound.
Listeners were asked four questions (one different question for each extract). It is
important to notice that they did not have the questionnaire in advance, but read the
questions one by one, while the test was going on (the students were only allowed
three minutes to write down their answers).
The questions are the following (EX1= Extract 1, EX2= Extract 2, etc.):
• EX1 – What are you hearing?
• EX2 – What are you listening to? Try to explain more. Try to describe the
sound.
• EX3 – In addition to what you have heard, try to say what you have felt and
try to explain this feeling.
• EX4 – What was your personal feeling? Try to explain the reasons why and
give us as many details as you can. Try to guess how the sound could be
made.
Results
When unprepared listeners hear a recording of overtone singing, one of their first
questions is usually: “How was this sound produced?” After a closer listening, they
may try to decipher the nature of the sound (the level of auditory education may play
a role in the perception of this phenomenon).
Class interval: ages
0
20
40
60
80
100
120
140
10 11 12 13 14 15 Ages
Number Boys Girls M.-C. Barras and A.-M. Gouiffès

64
Student’s reactions
Children of this age are still spontaneous enough to physically manifest what they are
feeling through body language. When they finally understand that it was produced by
a human voice, many adolescents — contrary to older audiences — are tempted to
reproduce the sound themselves, much like they did when first learning to speak. So,
reactions were varied: perplexity, laughter, miming, curiosity, pleasure, rejection,
enthusiasm, fear…5
Written answers
Responses for each extract in the open questionnaire and comments about the sound,
allowed several listening channels to be distinguished. The categories of the origin of
the sound in the Figure 2 are extracted from the free comments. The distribution of
the types of answers shows: source(s) of the sound(s) determined; voice(s) and/or
instrument(s) or labelled sound (without voice); perception of the source(s)
undetermined (see Figure 2).

In the channel noted ‘undetermined source of sound’, most of the answers specified
that they were unable to determine the source. So the remaining question was: “How
was this sound produced?”

Figure 2. Distribution of the types of answers per extract. The Reception of Overtone Singing

65
It was, at times, difficult to differentiate between reactions to musical characteristics
and those related to specific vocal techniques; however, we noticed that the negative
reactions (like ‘it gives me a headache’) seemed to be related to the feeling of the
obtrusive and awkward presence of the singer’s body (‘It’s like somebody with a sore
throat, and we feel it ourself’ – EX3, Justine, 13 years old), while positive reactions
were the expression of musical appreciation (like ‘It was beautiful, the way he sings’
– EX3, Joséphine, 13 years old). This point should be confirmed by data.
The answers were so interesting that other studies should be made related to the
vocabulary of specific notions that we had not foreseen (sense of the Sacred,
exoticism, feeling of strangeness, emotions, presence of the body, etc).
One example of answers concerning ‘Exoticism’
As a way to locate some unknown music (in a foreign country, from a civilization, a
cultural practice, a ritual or a ceremony…) or as an expression of ‘aesthetics of
diversity’

About 25% pupils spontaneously felt the need to locate this ‘sound’ in a country, or
identify it as originating from some civilization, some cultural practice, some ritual or
ceremony, etc. (we noticed that this need grew in proportion to the age of the students
from 11 to 15). Identification was often erroneous. Only one student referred to
Mongolia for Extract 1 (correctly) but he also referred to Mongolia for Extract 4
(South Africa, in fact). We can see that around 12% referred to Africa correctly for
the Ex4, and only 1% for the other extracts.
But, initially, getting a right or wrong answers was not a problem. These answers
were interesting because they allowed us to realize how the students figured out the
representation or the image of the ‘Difference’, the ‘Distance’.
The ‘furthest away’ in space is an ‘alien’, or an ‘unidentified flying object’. The
‘furthest away’ in time is a man ‘imitating a Prehistoric man’. Lea, 14 years old,
wrote: ‘It is as if I was in a cavern during the Prehistoric Age, before man could
speak’. One pupil imagined ‘Asian monks in Antiquity’; such an image summarizes a
single definition of the distant space-time and spiritual spheres. The answers cover a
map throughout the globe, with a preference for Africa, Asia and Australia (imagined
like a continent peopled with Aborigines only). We can assume that Europe and
America (except for the Indians!) and, in fact, generally Western cultures are not
quoted (or exceptionally in a different way: use of technology with electronically
altered voices). In their view, this is not exoticism (all of these adolescents are not
from Europe, but they would refer to Western culture). Yet, ‘a far-away place where
technology has yet to arrive’ (Aude, 14 years old), can be exotic…
All of these answers (a significant number of them from youngsters aged 14–15) lay
the foundation of ‘the aesthetics of the diversity’ concept, according to the words used
by Victor Segalen in his essay, Essai sur l’exotisme (1999). M.-C. Barras and A.-M. Gouiffès

66
We can suppose that the adolescents’ need to locate and culturally identify four
minutes of music (an unknown musical phenomenon) may have to do with their
growing individuality and conscience. These adolescents can assert their identity by
applying diversity to the Antipodes, the natural wilderness of a country, a temple in a
forest, tribal dances or rituals… The singer could be a Shaman, an Indian or an African
witch-doctor… They can hear ‘the national anthem of the jungle’, but they honestly
confess imagining things: ‘At the end, I interpreted it as a song sung for African
celebrations, even though I knew it was not so’. With ‘A human voice which does not
come from our cultures’ (Jianne, 13 years old), which ‘transports […] out of class’
(Julia, 14 years old), one leaves the secular world. Perhaps, because of the weirdness
of the sonority, this voice was often associated to sacred customs. As a religion,
Buddhism was frequently quoted (due, maybe, to what the Western world has learned
of Buddhism and its mystery).
We found other answers without any connection to a particular religion, but very
profound and tainted with mysticism: ‘I felt enveloped by the sound; it struck down
deep into my bones; it dissolved all the little noises in the background and became the
ONLY sound. It was so loud and overwhelming that I seemed to become a lonely rock
shaken by that moan. It was like a call to God that pierced the sky and shattered the
trivial things that surrounding me. It was everything in the world while it played….’
(Emma, 12 years old).
Background
At the end of the test, students were asked personal questions because we wanted to
know which kind of connection could be established between their way of hearing
and their personal background.
The questions relating to the background are the following:
• Do you play an instrument? If yes, which one and for how long?
• What is your mother tongue? Are there any other languages spoken at home
(except French)?
• Have you ever lived abroad or spent significant amounts of time abroad on
holiday (out of France)?
• Have you heard this kind of music before? Give us some details.
• What are your musical tastes?
The data about the background are still under analysis; however, a brief outlook can
be provided. For example, we noticed that several students mentioned the
‘didgeridoo’ as an instrument that could accompany the singer, or in which the singer
could sing while blowing through it. The second point is most interesting even if it is
not factual: it shows that listeners have perceived — consciously or not — the
existence of harmonics.
The Reception of Overtone Singing

67
• EX1 – Shalla-Marie, 13 years old:
‘A kind of highpitched voice. Vibrating. And we can hear the breaths he (or she) is
taking. It sounds like he is moving his mouth a lot to make the sound. It is a peculiar
sound, very peculiar and it sounds like the kind of music the aborigines in Australia
listen to and make.’

• EX1 – Bastien, 13, lived in Australia for 3 ½ years:
‘This sounds like the didgeridoo, the instrument of the Australian natives.’
• EX1 – Hilary, 11, Australian pupil:
‘I hear a man’s voice imitating the sound of a didgeridoo. Maybe a recorder.’
That is why we had to verify if those students had had special musical experience
when they lived or stayed abroad, or if they had specific knowledge linked to their
family origins. All the Australians pupils (or those who had had a long stay in
Australia) said the sound was reminiscent of the Aboriginal instrument.
Implications
In order to understand the mechanism ‘from the inside’, all the students of the 14
classes of the bilingual school “École Active Bilingue Jeannine Manuel” who have
undergone the tests, tried overtone singing (in the line of Zemp & Trân, 1989, 1997;
Trân & Souvet, 2004; Trân, 2005). Two weeks after the experiment, more than three
hundred pupils were taught the basics of the technique of overtone singing by Anne-
Marie Gouiffès as she was taught herself by Trân Quang Hai in 2004.
Reactions were various: girls were generally shy and inhibited, but when they tried
the results were often excellent. Almost all the boys considered this vocal experience
as a challenge and were very excited to perform.
Practising
We conducted our investigation in a bilingual school where multiculturalism was a
reality. In such an environment, signs of rejection or refusal on a cultural basis — if
they existed — were less obvious, less overt. One would not reject music because it is
strange (or ‘alien’) but resistance is based on the assumption that it is some ‘noise’,
some ‘sound’ that takes you aback, or is boring or unpleasant.

Practising the basics of overtone singing is a good way of discovering ‘alterity’ — or
‘otherness’ — because the feeling of voicing and listening implies a complete change
of habit. At the same time students discover new physical sensations and listen to
their own voices in a way they never did before. It sometimes reveals a new
personality, and the students who succeeded in overtone singing were admired by
their friends. Besides, we soon noticed that some students, usually shy when it came M.-C. Barras and A.-M. Gouiffès

68
to singing, were very happy to practice overtone singing. Probably, these students had
wished to learn overtone singing as it is sung in Mongolia or Tuva.
Despite broken voice, all the boys in the forms were willing to try and were interested
in the performance of others. ‘Sense of failure’ posed no problem. One student asked
whether written scores of overtone singing existed; this led to a discussion about oral
and written tradition… The matter was different with the girls. They were interested in
the description of the phenomenon but the majority of them refused to try overtone
singing.6 It may have to do with the fear that this way of singing, which causes funny
faces and transforms voices, might alter their own image.
Conclusion
After investigating a musical corpus and vocal technique little known to the general
public, we intended to bring a musicological contribution to the work of acousticians,
ethnomusicologists, and, obviously, musicians who practice overtone singing (Zemp
& Trân, 1989, 1997; Pegg, 2001).
This study proposed to examine overtone singing through the unique perspective of
listeners’ reception. The majority of listeners (a middle school audience) experienced
it as a real cultural confrontation with an unknown world of sound. The initial phase
of acculturation is therefore the most salient.
This type of research requires an interdisciplinary approach (musicology;
ethnomusicology; acoustics and its musical subdisciplines, and psychoacoustics; the
psychology of perception; sociology and cultural studies). After having conducted this
experiment, we are able to conclude that the discovery of overtone singing opened the
door to a transformation in the way one listens. It encourages opening up to other
artistic and cultural dimensions through a real education of the ear.
Acknowledgments
First of all, Anne-Marie Gouiffès would like to express her gratitude to Trân Quang
Hai, a master and a friend who has initiated her to the world of overtone singing and
accompanied her day by day in the fascinating universe of overtones. Marie-Cécile
Barras would like to warmly thank Michèle Castellengo, who supported her for many
years in a many scientific matters.

Anne-Marie Gouiffès also would like to particularly thank her friend and colleague
Dominique Ayné who helped her to edit a documentary on the living experiment.
Marie-Cécile Barras and Anne-Marie Gouiffès would also like to mention Myriam
Faurite, Henri Barras, Jean Civray and Matthew Thomas. The Reception of Overtone Singing

69
References
Alekseev, E., Kirgiz, Z. & Levin, T., (recordings and notes by) (1990). Voices from the center
of Asia. Smithsonian/Folkways SF 40017 [+CD audio].
Desjacques, A. (1993). Chants de l’Altai Mongol. Thèse de musicologie. Paris: Université de
Paris IV-Sorbonne.
Léothaud, G. (1989). Considérations acoustiques et musicales sur le chant diphonique. Le chant
diphonique, Dossier n° 1 (pp. 17-43). Limoges: Institut de la Voix.
Lortat-Jacob, B. (1998). Chants de Passions. Au cœur d’une confrérie de Sardaigne. Paris:
Editions du Cerf.
Maison des Cultures du monde (published by) (1989). Mongolie. Musique vocale et
instrumentale. Inédit W 260009 [CD audio].
Pegg, C. (2001). Mongolian music, dance and oral narrative: Performing diverse identities.
Seattle: University of Washington Press.
Pétrequin, P. & Weller, O. (recordings and notes by) (2001). Polyphonies de l’âge de pierre.
Les Dani de Nouvelle-Guinée, Volume II. Nord-Sud musique [+CD audio].
Segalen, V. (1999/1955). Essai sur l’exotisme. Paris: Le livre de poche.
Trân, Q. H. (1975). Technique de la voix chantée mongole: xöömij. Bulletin du CEMO
(14&15): 32-36. Paris.
_________. (1999). Overtones used in Tibetan Buddhist Chanting and in Tuvin Shamanism. In:
R. Astrauskas (ed.), Ritual and Music (pp. 129-136). Vilnius: Lithuanian Academy of
Music, Department of Ethnomusicology.
_________. (2002). À la découverte du chant diphonique. In: G. Cornut (éd.), Moyens
d’investigation et pédagogie de la voix chantée. Actes de colloque (pp.117-132). Lyon:
Symétrie.
_________. (2005). Recherches introspectives sur le chant diphonique et leurs applications
[Introspective research on overtone singing, and its application]. Penser la voix, 41. Rennes:
Presses universitaires de Rennes (online publication).
Trân, Q. H. & Souvet, L. (2004). Le chant diphonique. CRDP de la Réunion: SCÉREN (DVD)
[multimedia-documentary].
Walcott, R. (1974). The Chöömij of Mongolia. A spectral analysis of overtone singing. Selected
Reports in Ethnomusicology 2(1): 55-59. Los Angeles: UCLA.
Zemp, H. & Trân, Q. H. (1989). Le chant des harmoniques. CERIMES. [multimedia-
documentary].
_________. (1991). Recherches expérimentales sur le chant diphonique. Cahiers de Musiques
traditionnelles: Voix, 4: 27-68. Genève: Ateliers d’ethnomusicologie / AIMP.
_________. (1997). Le chant des harmoniques. Paris: CNRS Audiovisuel, Laboratoire d’Études
d’Ethnomusicologie et SFE. (VHS SECAM). [multimedia-documentary].
Zemp, H. (coordination), Léothaud, G., & Lortat-Jacob, B. (1996). Les voix du monde, une
anthologie des expressions vocales / Voices of the world, an anthology of vocal expression.
Collection “Musée de l’Homme, CNRS”. Le chant du Monde MCX 374 1010.12 [+ 3CD
audio].

1 Official website of the UMR laboratory of the CNRS 7186 (Centre National de la Recherche Scientifique,
France). 2 Created by Bodo Maass and Wolfgang Saus in 2003, this website ‘can be used as a spectrum analyzer and
a visual feedback tool as well as an interactive visualization of music theory’. 3 See the website http://www.khoomei.com. Created by Steve Sklar (USA), for Tuvan throat singing or Khoomei. 4 As he told us in a work session. M.-C. Barras and A.-M. Gouiffès

70
5 The reactions were recorded on video and presented at the 3rd Conference on Interdisciplinary
Musicology (CIM07), held in Tallinn, Estonia, 15–19 August 2007, on the theme of singing. 6 On the video documentary presented at CIM07, some girls can be seen trying overtone singing but these
instances were the only ones.

ANDY MURPHY: Ph.D. thesis:An Investgation of Overtone Singing for Electroacoustic Composition using FOF Synthesis 2013 Trinity College, Dublin

Standard

ThesisPDF Available

An Investigation of Overtone Singing for Electroacoustic Composition using FOF Synthesis

  • January 2013

DOI:10.13140/RG.2.2.23717.32489

  • Thesis for: MPhil in Music and Media Technologies
  • Advisor: Dermot Furlong

Authors: Andy Murphy

Music & Media Technologies
School of Engineering
&
School of Drama, Film and Music
Trinity College Dublin

Submitted as part fulfilment for the degree of M.Phil.
2013

https://www.researchgate.net/publication/346470023_An_Investigation_of_Overtone_Singing_for_Electroacoustic_Composition_using_FOF_Synthesis

Dr ANDY MURPHY

MALTE KOB : SINGING VOICE ACOUSTICS PROJECT

Standard

Singing

Voice Acoustics

Project log

MALTE KOB

Mar 12, 2019Malte Kobadded a research item53: Future PerspectivesChapter

  • Apr 2019

View

Jan 8, 2018Malte Kobadded a research itemSynthèse de la voix chantéeChapter

  • Apr 2014

View

Sep 13, 2017Malte Kobadded 40 project referencesPhysikalische Modellierung der menschlichen StimmeConference Paper

  • Mar 2003
  • DAGA 2003 Bochum

Stimme und Raum, Raum in der Stimme: WechselwirkungenChapter

  • Jan 2011
  • Aspekte des Singens I: Voraussetzungen, Klangparameter, Ausdrucksformen

A system for parallel measurement of glottis opening and larynx positionArticle

  • Jul 2009

Aug 17, 2017Malte Kobadded 2 research itemsIntonation im ChorArticle

  • Nov 2015

View

Die SingstimmeChapter

  • Oct 2014

View

https://imasdk.googleapis.com/js/core/bridge3.489.0_en.html#goog_111048589

App Store

CompanyAbout usNewsCareersSupportHelp CenterBusiness solutions

https://www.researchgate.net/project/Singing-Voice-Acoustics

Christopher Bergevin, Chandan Naravan, Joy Williams, Natasha Mhatre, Jennifer KE Steees, Joshua GW Bernstein, Brad Story : Overtone focusing in biphonic tuvan throat singing

Standard
Is a corresponding author

https://elifesciences.org/articles/50476

Overtone focusing in biphonic tuvan throat singing

  1. Christopher Bergevin  Is a corresponding author ,
  2. Chandan Narayan,
  3. Joy Williams,
  4. Natasha Mhatre,
  5. Jennifer KE Steeves,
  6. Joshua GW Bernstein,
  7. Brad Story  Is a corresponding author
  1. Physics and Astronomy, York University, Canada;
  2. Centre for Vision Research, York University, Canada;
  3. Fields Institute for Research in Mathematical Sciences, Canada;
  4. Kavli Institute of Theoretical Physics, University of California, United States;
  5. Languages, Literatures and Linguistics, York University, Canada;
  6. York MRI Facility, York University, Canada;
  7. Biology, Western University, Canada;
  8. Psychology, York University, Canada;
  9. National Military Audiology & Speech Pathology Center, Walter Reed National Military Medical Center, United States

Research Article Feb 12, 2020

Abstract

Khoomei is a unique singing style originating from the republic of Tuva in central Asia. Singers produce two pitches simultaneously: a booming low-frequency rumble alongside a hovering high-pitched whistle-like tone. The biomechanics of this biphonation are not well-understood. Here, we use sound analysis, dynamic magnetic resonance imaging, and vocal tract modeling to demonstrate how biphonation is achieved by modulating vocal tract morphology. Tuvan singers show remarkable control in shaping their vocal tract to narrowly focus the harmonics (or overtones) emanating from their vocal cords. The biphonic sound is a combination of the fundamental pitch and a focused filter state, which is at the higher pitch (1–2 kHz) and formed by merging two formants, thereby greatly enhancing sound-production in a very narrow frequency range. Most importantly, we demonstrate that this biphonation is a phenomenon arising from linear filtering rather than from a nonlinear source.

Gerrit Bloothooft, Eldrid Bringmann, Marieke van Cappellen, Jolanda B. van Luipen, and Koen P. ThomassenView Affiliations / Acoustics and perception of overtone singing

Standard

Publisher Logo

The Journal of the Acoustical Society of AmericaFacebookTwitterSUBMIT YOUR ARTICLE

SIGN UP FOR ALERTS

Prev Next No Access Submitted: 29 March 1991 Accepted: 04 June 1992 Published Online: 04 June 1998

Acoustics and perception of overtone singing

The Journal of the Acoustical Society of America 92, 1827 (1992); https://doi.org/10.1121/1.403839 Gerrit Bloothooft, Eldrid Bringmann, Marieke van Cappellen, Jolanda B. van Luipen, and Koen P. ThomassenView Affiliations

Sharemetrics

ABSTRACTOvertone singing, a technique of Asian origin, is a special type of voice production resulting in a very pronounced, high and separate tone that can be heard over a more or less constant drone. An acoustic analysis is presented of the phenomenon and the results are described in terms of the classical theory of speech production. The overtone sound may be interpreted as the result of an interaction of closely spaced formants. For the lower overtones, these may be the first and second formant, separated from the lower harmonics by a nasal pole‐zero pair, as the result of a nasalized articulation shifting from /c/ to /a/, or, as an alternative, the second formant alone, separated from the first formant by the nasal pole‐zero pair, again as the result of a nasalized articulation around /c/. For overtones with a frequency higher than 800 Hz, the overtone sound can be explained as a combination of the second and third formant as the result of a careful, retroflex, and rounded articulation from /c/, via schwa /E/ to /y/ and /i/ for the highest overtones. The results indicate a firm and relatively long closure of the glottis during overtone phonation. The corresponding short open duration of the glottis introduces a glottal formant that may enhance the amplitude of the intended overtone. Perception experiments showed that listeners categorized the overtone sounds differently from normally sung vowels, which possibly has its basis in an independent perception of the small bandwidth of the resonance underlying the overtone. Their verbal judgments were in agreement with the presented phonetic‐acoustic explanation.

  1. © 1992 Acoustical Society of America.
  2. https://asa.scitation.org/doi/10.1121/1.403839

Christopher Bergevin, Chandan Narayan, Joy Williams, Natasha Mhatre, Jennifer KE Steeves, Joshua GW Bernstein, Brad Story : Overtone focusing in biphonic tuvan throat singing

Standard

Overtone focusing in biphonic tuvan throat singing

  1. Christopher Bergevin  Is a corresponding author ,
  2. Chandan Narayan,
  3. Joy Williams,
  4. Natasha Mhatre,
  5. Jennifer KE Steeves,
  6. Joshua GW Bernstein,
  7. Brad Story  Is a corresponding author
  1. Physics and Astronomy, York University, Canada;
  2. Centre for Vision Research, York University, Canada;
  3. Fields Institute for Research in Mathematical Sciences, Canada;
  4. Kavli Institute of Theoretical Physics, University of California, United States;
  5. Languages, Literatures and Linguistics, York University, Canada;
  6. York MRI Facility, York University, Canada;
  7. Biology, Western University, Canada;
  8. Psychology, York University, Canada;
  9. National Military Audiology & Speech Pathology Center, Walter Reed National Military Medical Center, United States

Research Article Feb 12, 2020

Cite as: eLife 2020;9:e50476 doi: 10.7554/eLife.50476

Abstract

Khoomei is a unique singing style originating from the republic of Tuva in central Asia. Singers produce two pitches simultaneously: a booming low-frequency rumble alongside a hovering high-pitched whistle-like tone. The biomechanics of this biphonation are not well-understood. Here, we use sound analysis, dynamic magnetic resonance imaging, and vocal tract modeling to demonstrate how biphonation is achieved by modulating vocal tract morphology. Tuvan singers show remarkable control in shaping their vocal tract to narrowly focus the harmonics (or overtones) emanating from their vocal cords. The biphonic sound is a combination of the fundamental pitch and a focused filter state, which is at the higher pitch (1–2 kHz) and formed by merging two formants, thereby greatly enhancing sound-production in a very narrow frequency range. Most importantly, we demonstrate that this biphonation is a phenomenon arising from linear filtering rather than from a nonlinear source.eLife digest

The republic of Tuva, a remote territory in southern Russia located on the border with Mongolia, is perhaps best known for its vast mountainous geography and the unique cultural practice of “throat singing”. These singers simultaneously create two different pitches: a low-pitched drone, along with a hovering whistle above it. This practice has deep cultural roots and has now been shared more broadly via world music performances and the 1999 documentary Genghis Blues.

Despite many scientists being fascinated by throat singing, it was unclear precisely how throat singers could create two unique pitches. Singing and speaking in general involves making sounds by vibrating the vocal cords found deep in the throat, and then shaping those sounds with the tongue, teeth and lips as they move up the vocal tract and out of the body. Previous studies using static images taken with magnetic resonance imaging (MRI) suggested how Tuvan singers might produce the two pitches, but a mechanistic understanding of throat singing was far from complete.

Now, Bergevin et al. have better pinpointed how throat singers can produce their unique sound. The analysis involved high quality audio recordings of three Tuvan singers and dynamic MRI recordings of the movements of one of those singers. The images showed changes in the singer’s vocal tract as they sang inside an MRI scanner, providing key information needed to create a computer model of the process.

This approach revealed that Tuvan singers can create two pitches simultaneously by forming precise constrictions in their vocal tract. One key constriction occurs when tip of the tongue nearly touches a ridge on the roof of the mouth, and a second constriction is formed by the base of the tongue. The computer model helped explain that these two constrictions produce the distinctive sounds of throat singing by selectively amplifying a narrow set of high frequency notes that are made by the vocal cords. Together these discoveries show how very small, targeted movements of the tongue can produce distinctive sounds.Introduction

In the years preceding his death, Richard Feynman had been attempting to visit the small republic of Tuva located in geographic center of Asia (Leighton, 2000). A key catalyst came from Kip Thorne, who had gifted him a record called Melody tuvy, featuring a Tuvan singing in a style known as Khoomei, or Xöömij. Although he was never successful in visiting Tuva, Feynman was nonetheless captivated by Khoomei, which can be best described as a high-pitched tone, similar to a whistle carrying a melody, hovering above a constant booming low-frequency rumble. This is a form of biphonation, or in Feynman’s own words, “a man with two voices”. Khoomei, now a part of the UNESCO Intangible Cultural Heritage of Humanity, is characterized as “the simultaneous performance by one singer of a held pitch in the lower register and a melody … in the higher register” (Aksenov, 1973). How, indeed, does one singer produce two pitches at one time? Even today, the biophysical underpinnings of this biphonic human vocal style are not fully understood.

Normally, when a singer voices a song or speech, their vocal folds vibrate at a fundamental frequency (f0), generating oscillating airflow, forming the so-called source. This vibration is not, however, simply sinusoidal, as it also produces a series of harmonics tones (i.e., integer multiples of f0) (Figure 1). Harmonic frequencies in this sound above f0 are called overtones. Upon emanating from the vocal folds, they are then sculpted by the vocal tract, which acts as a spectral filter. The vocal-tract filter has multiple resonances that accentuate certain clusters of overtones, creating formants. When speaking, we change the shape of our vocal tract to shift formants in systematic ways characteristic of vowel and consonant sounds. Indeed, singing largely uses vowel-like sounds (Story, 2016). In most singing, the listener perceives only a single pitch associated with the f0 of the vocal production, with the formant resonances determining the timbre. Khoomei has two strongly emphasized pitches: a low-pitch drone associated with the f0

, plus a melody carried by variation in the higher frequency formant that can change independently (Kob, 2004). Two possible loci for this biphonic property are the source and/or the filter. Figure 1

Frequency spectra for three different singers transitioning from normal to biphonic singing.

Vertical white lines in the spectrograms (left column) indicate the time point for the associated spectrum in the right column. Transition points from normal to biphonic singing state are denoted by …

A source-based explanation could involve different mechanisms, such as two vibrating nonlinear sound sources in the syrinx of birds, which produce multiple notes that are harmonically unrelated (Fee et al., 1998; Zollinger et al., 2008). Humans however are generally considered to have only a single source, the vocal folds. But there are an alternative possibilities: for instance, the source could be nonlinear and produce harmonically-unrelated sounds. For example, aerodynamic instabilities are known to produce biphonation (Mahrt et al., 2016). Further, Khoomei often involves dramatic and sudden transitions from simple tonal singing to biophonation (see Figure 1 and the Appendix for associated audio samples). Such abrupt changes are often considered hallmarks of physiological nonlinearity (Goldberger et al., 2002), and vocal production can generally be nonlinear in nature (Herzel and Reuter, 1996; Mergell and Herzel, 1997; Fitch et al., 2002; Suthers et al., 2006). Therefore it remains possible that biphonation arises from nonlinear source considerations.

Vocal tract shaping, a filter-based framework, provides an alternative explanation for biphonation. In one seminal study of Tuvan throat singing, Levin and Edgerton examined a wide variety of song types and suggested that there were three components at play. The first two (‘tuning a harmonic’ relative to the filter and lengthening the closed phase of the vocal fold vibration) represented a coupling between source and filter. But it was the third, narrowing of the formant, that appeared crucial. Yet, the authors offered little empirical justification for how these effects are produced by the vocal tract shape in the presented radiographs. Thus it remains unclear how the high-pitched formant in Khoomei was formed (Grawunder, 2009). Another study (Adachi and Yamada, 1999) examined a throat singer using magnetic resonance imaging (MRI) and captured static images of the vocal tract shape during singing. These images were then used in a computational model to produce synthesized song. Adachi and Yamada argued that a “rear cavity” was formed in the vocal tract and its resonance was essential to biphonation. However, their MRI data reveal limited detail since they were static images of singers already in the biphonation state. Small variations in vocal tract geometry can have pronounced effects on produced song (Story et al., 1996) and data from static MRI would reveal little about how and which parts of the vocal tract change shape as the singers transition from simple tonal song to biphonation. To understand which features of vocal tract morphology are crucial to biophonation, a dynamic description of vocal tract morphology would be required.

Here we study the dynamic changes in the vocal tracts of multiple expert practitioners from Tuva as they produce Khoomei. We use MRI to acquire volumetric 3D shape of the vocal tract of a singer during biphonation. Then, we capture the dynamic changes in a midsagittal slice of the vocal tract as singers transition from tonal to biphonic singing while making simultaneous audio recordings of the song. We use these empirical data to guide our use of a computational model, which allows us to gain insight into which features of vocal tract morphology are responsible for the singing phonetics observed during biophonic Khoomei song (e.g., Story, 2016). We focus specifically on the Sygyt (or Sigit) style of Khoomei (Aksenov, 1973).Results

Audio recordings

We made measurements from three Tuvan singers performing Khoomei in the Sygyt style (designated as T1–T3) and one (T4) in a non-Sygyt style. Songs were analyzed using short-time Fourier transforms (STFT), which provide detailed information in both temporal and spectral domains. We recorded the singers transitioning from normal singing into biphonation, Figure 1 showing this transition for three singers. The f0 of their song is marked in the figure (approximately 140 Hz for subject T2, 164 Hz for both T1 and T3) and the overtone structure appears as horizontal bands. Varying degrees of vibrato can be observed, dependent upon the singer (Figure 1; see also longer spectrograms in Appendix 1—figure 6 and Appendix 1—figure 7). Most of the energy in their song is concentrated in the overtones and no subharmonics (i.e., peaks at half-integer multiples of f0

) are observed. In contrast to these three singers, singer T4 performing in a non-Sygyt style exhibited a fundamental frequency of approximately 130 Hz, although significant energy additionally appears around 50–55 Hz, well below an expected subharmonic (Appendix 1—figure 5).

If we take a slice, that is a time-point from the spectrogram and plot the spectrum, we can observe the peaks to infer the formant structure from this representation of the sound (red-dashed lines in Figure 1 and Appendix 1—figure 4). As the singers transition from normal singing to biphonation, we see that the formant structure changes significantly and the positions of formant peaks shift dramatically and rapidly. Note that considering time points before and after the transitions also provides an internal control for both normal and focused song types (Appendix 1—figure 4). Once in the biphonation mode, all three singers demonstrate overtones in a narrow spectral band around 1.5–2 kHz; we refer to this as the focused state. Specifically, Figure 1 shows that not only is just a single or small group of overtones accentuated, but also that nearby ones are greatly attenuated: ±1 overtones are as much 15–35 dB and ±2 overtones are 35–65 dB below the central overtone. Whereas the energy in the low-frequency region associated with the first formant (below 500 Hz) is roughly constant between the normal-singing and focused states, there is a dramatic change in the spectrum for the higher formants above 500 Hz. In normal singing (i.e., prior to the focused state), spectral energy is distributed across several formants between 500 and 4000 Hz. In the focused state after the transition, the energy above 500 Hz becomes narrowly focused in the 1.5–2 kHz region, generating a whistle-like pitch that carries the song melody.

To assess the degree of focus objectively and quantitatively, we computed an energy ratio eR(fL,fH) that characterizes the relative degree of energy brought into a narrow band against the energy spread over the full spectrum occupied by human speech (see Materials and methods). In normal speech and singing, for [fL,fH]=[1,2kHz], typically eR is small (i.e., energy is spread across the spectrum, not focused into that narrow region between 1 and 2 kHz). For the Tuvan singers, prior to a transition into a focused state, eR(1,2) is similarly small. However once the transition occurs (red triangle in Figure 1), those values are large (upwards of 0.5 and higher) and sustained across time (Appendix 1—figure 2 and Appendix 1—figure 3). For one of the singers (T2) the situation was more complex, as he created multiple focused formants (Figure 1 middle panels and Appendix 1—figure 6, Appendix 1—figure 8). The second focused state was not explicitly dependent upon the first: The first focused state clearly moves and transitions between approximately 1.5–2 kHz (by 30%) while the second focused state remains constant at approximately 3–3.5 kHz (changing less than 1%). Thus the focused states are not harmonically related. Unlike the other singers, T2 not only has a second focused state, but also had more energy in the higher overtones (Figure 1). As such, singer T2 also exhibited a different eR

time course, which took on values that could be relatively large even prior to the transition. This may be because he took multiple ways to approach the transition into a focused state (e.g., Appendix 1—figure 9).

Plotting spectra around the transition from normal to biphonation singing in a waterfall plot indicates that the sharp focused filter is achieved by merging two broader formants together (F2 and F3

in Figure 2Kob, 2004). This transition into the focused state is fast (∼40–60 ms), as are the shorter transitions within the focused state where the singer melodically changes the filter that forms the whistle-like component of their song (Figure 1, Appendix 1—figure 8). Figure 2

A waterfall plot representing the spectra at different time points as singer T2 transitions from normal singing into biphonation (T2_3short.wav).

The superimposed arrows are color-coded to help visualize how the formants change about the transition, chiefly with F3 shifting to merge with F2. This plot also indicates the second focused state …

Vocal tract MRI

While we can infer the shape of the formants in Khoomei by examining audio recordings, such analysis is not conclusive in explaining the mechanism used to achieve these formants. The working hypothesis was that vocal tract shape determines these formants. Therefore, it was crucial to examine the shape and dynamics of the vocal tract to determine whether the acoustic measurements are consistent with this hypothesis. To accomplish this, we obtained MRI data from one of the singers (T2) that are unique in two regards. First, there are two types of MRI data reported here: steady-state volumetric data Figure 3 and Appendix 1—figure 18) and dynamic midsagittal images at several frames per second that capture changes in vocal tract position (Figure 4A–B and Appendix 1—figure 20). Second is that the dynamic data allow us to examine vocal tract changes as song transitions into a focused state (e.g., Appendix 1—figure 20). Figure 3

3-D reconstruction of volumetric MRI data taken from singer T2 (Run3; see Appendix, including Appendix 1—figure 18).

(A) Example of MRI data sliced through three different planes, including a pseudo-3D plot. Airspaces were determined manually (green areas behind tongue tip, red for beyond). Basic labels are … Figure 4

Analysis of vocal tract configuration during singing.

(A) 2D measurement of tract shape. The inner and outer profiles were manually traced, whereas the centerline (white dots) was found with an iterative bisection technique. The distance from the inner …

The human vocal tract begins at the vocal folds and ends at the lips. Airflow produced by the vocal cords sets the air-column in the tract into vibration, and its acoustics determine the sound that emanates from the mouth. The vocal tract is effectively a tube-like cavity whose shape can be altered by several articulators: the jaw, lips, tongue, velum, epiglottis, larynx and trachea (Figure 4C). Producing speech or song requires that the shape of the vocal tract, and hence its acoustics, are precisely controlled (Story, 2016).

Several salient aspects of the vocal tract during the production of Khoomei can be observed in the volumetric MRI data. The most important feature however, is that there are two distinct and relevant constrictions when in the focused state, corresponding roughly to the uvula and alveolar ridge. Additionally, the vocal tract is expanded in the region just anterior to the alveolar ridge (Figure 4A). The retroflex position of the tongue tip and blade produces a constriction at 14 cm, and also results in the opening of this sublingual space. It is the degree of constriction at these two locations that is hypothesized to be the primary mechanism for creating and controlling the frequency at which the formant is focused.

Modeling

Having established that the shape of vocal tract during Khoomei does indeed have two constrictions, consistent with observations from other groups, the primary goals of our modeling efforts were to use the dynamic MRI data as morphological benchmarks and capture the merging of formants to create the focused states as well as the dynamic transitions into them. Our approach was to use a well-established linear “source/filter” model (e.g., Stevens, 2000) that includes known energy losses (Sondhi and Schroeter, 1987; Story et al., 2000; Story, 2013). Here, the vibrating vocals folds act as the broadband sound source (with the f0

and associated overtone cascade), while resonances of the vocal tract, considered as a series of 1-D concatenated tubes of variable uniform radius, act as a primary filter. We begin with a first order assumption that the system behaves linearly, which allows us for a simple multiplicative relationship between the source and filter in the spectral domain (e.g., Appendix 1—figure 10).

Acoustic characteristics of the vocal tract can be captured by transforming the three-dimensional configuration (Figure 3) into a tube with variation in its cross-sectional area from the glottis to the lips (Figure 4 and Figure 5). This representation of the vocal tract shape is called an area function, and allows for calculation of the corresponding frequency response function (from which the formant frequencies can be determined) with a one-dimensional wave propagation algorithm. Although the area function can be obtained directly from a 3D vocal tract reconstruction (e.g., Story et al., 1996), the 3D reconstructions of the Tuvan singer’s vocal tract were affected by a large shadow from a dental post (e.g., see Figure 4) and were not amenable to detailed measurements of cross-sectional area. Instead, a cross-sectional area function was measured from the midsagittal slice of the 3D image set (see Materials and methods and Appendix for details). Thus, the MRI data provided crucial bounds for model parameters: the locations of primary constrictions and thereby the associated area functions. Figure 5

Results of changing vocal tract morphology in the model by perturbing the baseline area function A0(x)
to demonstrate the merging of formants F2
and F3
, atop two separate overtones as apparent in the two columns of panels A and B.

(A) The frames from dynamic MRI with red and blue dashed circles highlighting the location of the key vocal tract constrictions. (B) Model-based vocal tract shapes stemming from the MRI data, … The frequency response functions derived from the above static volumetric MRI data (e.g., Figure 4D) indicate that two formants F2 and F3 cluster together, thus enhancing both their amplitudes. Clearly, if F2 and F3

could be driven closer together in frequency, they would merge and form a single formant with unusually high amplitude. We hypothesize that this mechanism could be useful for effectively amplifying a specific overtone, such that it becomes a prominent acoustic feature in the sound produced by a singer, specifically the high frequency component of Khoomei.

Next, we used the model in conjunction with time-resolved MRI data to investigate how the degree of constriction and expansion at different locations along the vocal tract axis could be a mechanism for controlling the transition from normal to overtone singing and the pitch while in the focused state. These results are summarized in Figure 5 (further details are in the Appendix). While the singers are in the normal song mode, there are no obvious strong constrictions in their vocal tracts (e.g., Appendix 1—figure 11). After they transition, in each MRI from the focused state, we observe a strong constriction near the alveolar ridge. We also observe a constriction near the uvula in the upper pharynx, but the degree of constriction here varies. If we examine the simultaneous audio recordings, we find that variations in this constriction are co-variant with the frequency of the focused formant. From this, we surmise that the mechanism for controlling the enhancement of voice harmonics is the degree of constriction near the alveolar ridge in the oral cavity (labeled CO in Figure 5), which affects the proximity of F2 and F3 to each other (Appendix 1—figure 12). Additionally, the degree of constriction near the uvula in the upper pharynx (CP) controls the actual frequency at which F2 and F3 converge (Appendix 1—figure 13). Other parts of the vocal tract, specifically the expansion anterior to CO

, may also contribute since they also show small co-variations with the focused formant frequency (Appendix 1—figure 14). Further, a dynamic implementation of the model, as shown in Appendix 1—figure 14, reasonably captures the rapid transition into/out of the focused state as shown in Figure 1. Taken together, the model confirms and explains how these articulatory changes give rise to the observed acoustic effects.

To summarize, an overtone singer could potentially ‘play’ (i.e., select) various harmonics of the voice source by first generating a tight constriction in the oral cavity near the alveolar ridge, and then modulating the degree of constriction in the uvular region of the upper pharynx to vary the position of the focused formant, thereby generating a basis for melodic structure.Discussion

This study has shown that Tuvan singers performing Sygyt-style Khoomei exercise precise control of the vocal tract to effectively merge multiple formants together. They morph their vocal tracts so to create a sustained focused state that effectively filters an underlying stable array of overtones. This focused filter greatly accentuates energy of a small subset of higher order overtones primarily in the octave-band spanning 1–2 kHz, as quantified by an energy ratio eR(1,2)

. Some singers are even capable of producing additional foci at higher frequencies. Below, we argue that a linear framework (i.e., source/filter model, Stevens, 2000) appears sufficient to capture this behavior including the sudden transitions into a focused state, demonstrating that nonlinearities are not a priori essential. That is, since the filter characteristics are highly sensitive to vocal tract geometry, precise biomechanical motor control of the singers is sufficient to achieve a focused state without invoking nonlinearities or a second source as found in other vocalization types (e.g., Herzel and Reuter, 1996; Fee et al., 1998). Lastly, we describe several considerations associated with how focused overtone song produces such a salient percept by virtue of a pitch decoherence.

Source or filter?

The notion of a focused state is mostly consistent with vocal tract filter-based explanations for biphonation in previous studies (e.g., Bloothooft et al., 1992; Edgerton et al., 1999; Adachi and Yamada, 1999; Grawunder, 2009), where terms such as an ‘interaction of closely spaced formants’, ‘reinforced harmonics’, and ‘formant melting’ were used. In addition, the merging of multiple formants is closely related to the ‘singer’s formant’, which is proposed to arise around 3 kHz due to formants F3–F5 combining (Story, 2016), though this is typically broader and less prominent than the focused states exhibited by the Tuvans. Our results explain how this occurs and are also broadly consistent with Adachi and Yamada (1999) in that a constricted ‘rear cavity’ is crucial. However, we find that this rear constriction determines the pitch of the focused formant, whereas it is the ‘front cavity’ constriction near the alveolar ridge that produces the focusing effect (i.e., merging of formants F2 and F3

).

Further, the present data appear in several ways inconsistent with conclusions from previous studies of Khoomei, especially those that center on effects that arise from changes in the source. Three salient examples are highlighted. First, we observed overtone structure to be highly stable, though some vibrato may be present. This contrasts the claim by Levin and Edgerton (1999) that “(t)o tune a harmonic, the vocalist adjusts the fundamental frequency of the buzzing sound produced by the vocal folds, so as to bring the harmonic into alignment with a formant’. That is, we see no evidence for the overtone ‘ladder’ being lowered or lifted as they suggested (note in Figure 1, f0

is held nearly constant). Further, this stability argues against a transition into a different mode of glottal pulse generation, which could allow for a ‘second source’ (Mergell and Herzel, 1997). Second, a single sharply defined harmonic alone is not sufficient to get the salient perception of a focused state, as had been suggested by Levin and Edgerton (1999). Consider Appendix 1—figure 9, especially at the 4 s mark, where the voicing is ‘pressed’. Pressed phonation, also referred to as ventricular voice, occurs when glottal flow is affected by virtue of tightening the laryngeal muscles such that the ventricular folds are brought into vibration. This has the perceptual effect of adding a degree of roughness to the voice sound (Lindestad et al., 2001; Edmondson and Esling, 2006). There, a harmonic at 1.51 kHz dominates (i.e., the two flanking overtones are approximately 40 dB down), yet the song has not yet perceptibly transitioned. It is not until the cluster of overtones at 3–3.5 kHz is brought into focus that the perceptual effect becomes salient, perhaps because prior to the 5.3 s mark the broadband nature of those frequencies effectively masks the first focused state. Third, we do not observe subharmonics, which contrasts a prior claim (Lindestad et al., 2001) that ”(t)his combined voice source produces a very dense spectrum of overtones suitable for overtone enhancement’. However, that study was focused on a different style of song called ‘Kargyraa’, which does not exhibit as clearly a focused state as in Sygyt.

Linear versus nonlinear mechanisms

An underlying biophysical question is whether focused overtone song arises from inherently linear or nonlinear processes. Given that Khoomei consists of the voicing of two or more pitches at once and exhibits dramatic and fast transitions from normal singing to biphonation, nonlinear phenomena may seem like an obvious candidate (Herzel and Reuter, 1996). It should be noted that Herzel and Reuter (1996) go so far to define biphonation explicitly through the lens of nonlinearity. We relax such a definition and argue for a perceptual basis for delineating the boundaries of biphonation. Certain frog species exhibit biphonation, and it has been suggested that their vocalizations can arise from complex nonlinear oscillatory regimes of separate elastically coupled masses (Suthers et al., 2006). Further, the appearance of abrupt changes in physiological systems (as seen in Figure 1) has been argued to be a flag for nonlinear mechanisms (Goldberger et al., 2002); for example, by virtue of progression through a bifurcation.

Our results present two lines of evidence that argue against Sygyt-style Khoomei arising primarily from a nonlinear process. First, the underlying harmonic structure of the vocal fold source appears highly stable through the transition into the focused state (Figure 1). There is little evidence of subharmonics. A source spectral structure that is comprised of an f0

and integral harmonics would suggest a primarily linear source mechanism. Second is that our modeling efforts, which are chiefly linear in nature, reasonably account for the sudden and salient transition. That is, the model is readily sufficient to capture the characteristic that small changes in the vocal tract can produce large changes in the filter. Thereby, precise and fast motor control of the articulators in a linear framework accounts for the transitions into and out of the focused state. Thus, in essence, Sygyt-style Khoomei could be considered a linear means to achieve biphonation. Connecting back to nonlinear phonation mechanisms in non-mammals, our results provide further context for how human song production and perception may be similar and/or different relative to that of non-humans (e.g., Doolittle et al., 2014; Kingsley et al., 2018).

Nevertheless, features that appear transiently in spectrograms do provide hints of source nonlinearity, such as the brief appearance of subharmonics in some instances (Appendix 1—figure 15B). This provides an opportunity to address the limitations of the current modeling efforts and to highlight future considerations. We suggest that further analysis (e.g., Theiler et al., 1992; Tokuda et al., 2002; Kantz and Schreiber, 2004) of Khoomei audio recordings may help to inform the model and might better capture focused filter sharpness and the origin of secondary focused states. Several potential areas for improvement are: nonlinear source–filter coupling (Titze et al., 2008); a detailed model of glottal dynamics (e.g., ratio of open/closed phases in glottal flow [Grawunder, 2009; Li and Hou, 2017], and periodic vibrations in f0

); inclusion of piriform sinuses as side-branch resonators (Dang and Honda, 1997; Titze and Story, 1997); inclusion of the 3-D geometry; and detailed study of the front cavity (e.g., lip movements) that may be used by the singer to maintain control of the focused state and to make subtle manipulations.

Perceptual consequences of overtone focusing

Although this study did not directly assess the percept associated with these vocal productions, the results raise pressing questions about how the spectro-temporal signatures of biphonic Khoomei described here create the classical perception of Sygyt-style Khoomei as two distinct sounds (Aksenov, 1973). The first, the low-pitched drone, which is present during both the normal singing and the focused-state biphonation intervals, reflects the pitch associated with f0, extracted from the harmonic representation of the stimulus. It is well established that the perceived pitch of a broadband sound comprised of harmonics reflects the f0 derived primarily from the perceptually resolved harmonics up to about 10f0 (Bernstein and Oxenham, 2003). The frequency resolution of the peripheral auditory system is such that these low-order harmonics are individually resolved by the cochlea, and it appears that such filtering is an important prerequisite for pitch extraction associated with that common f0. The second sound, the high-pitched melody, is present only during the focused-state intervals and probably reflects a pitch associated with the focused formant. An open question, however, is why this focused formant would be perceived incoherently as a separate pitch (Shamma et al., 2011), when it contains harmonics at multiples of f0

. The auditory system tends to group together concurrent harmonics into a single perceived object with a common pitch (Roberts et al., 2015), and the multiple formants of a sung or unsung voice are not generally perceived as separate sounds from the low harmonics.

The fact that the focused formant is so narrow apparently leads the auditory system to interpret this sound as if it were a separate tone, independent of the low harmonics associated with the drone percept, thereby effectively leading to a pitch decoherence. This perceptual separation could be attributable to a combination of both bottom-up (i.e., cochlear) and top-down (i.e., perceptual) factors. From the bottom-up standpoint, even if the focused formant is broad enough to encompass several harmonic components, the fact that it consists of harmonics at or above 10 f0 (i.e., the 1500 Hz formant frequency represents the 10th harmonic of a 150 Hz f0) means that these harmonics will not be spectrally resolved by cochlear filtering (Bernstein and Oxenham, 2003). Instead, the formant will be represented as a single spectral peak, similar to the representation of a single pure tone at the formant frequency. Although the interaction of harmonic components at this cochlear location will generate amplitude modulation at a rate equal to the f0 (Plack and Oxenham, 2005), it has been argued that a common f0 is a weak cue for binding low- and high-frequency formants (Culling and Darwin, 1993). Rather, other top-down mechanisms of auditory-object formation may play a more important role in generating a perception of two separate objects in Khoomei. For example, the rapid onsets of the focused formant may enhance its perceptual separation from the constant drone (Darwin, 1984). Further, the fact that the focused formant has a variable frequency (i.e., frequency modulation, or FM) while the drone maintains a constant f0

is another difference that could facilitate their perceptual separation. Although it has been argued that FM differences between harmonic sounds generally have little influence on their perceived separation (Darwin, 2005), others have reported enhanced separation in the special case in which one complex was static and the other had applied FM (Summerfield and Culling, 1992) – similar to the first and second formants during the Tuvan focused state.

The perceptual separation of the two sounds in the Tuvan song might be further affected by a priori expectations about the spectral qualities of vocal formants (Billig et al., 2013). Because a narrow formant occurs so rarely in natural singing and speech, the auditory system might be pre-disposed against perceiving it as a phonetic element, limiting its perceptual integration with the other existing formants. Research into ‘sine-wave speech’ provides some insights into this phenomenon. When three or four individual frequency-modulated sinusoids are presented at formant frequencies in lieu of natural formants, listeners can, with sufficient training, perceive the combination as speech (Remez et al., 1981). Nevertheless, listeners largely perceive these unnatural individual pure tones as separate auditory objects (Remez et al., 2001), much like the focused formant in Khoomei. Further research exploring these considerations would help close the production–perception circle underlying the unique percept arising from Tuvan throat song.Materials and methods

Acoustical recordings

Request a detailed protocol

Recordings were made at York University (Toronto, ON, Canada) in a double-walled acoustic isolation booth (IAC) using a Zoom H5 24-bit digital recorder and an Audio-Technica P48 condenser microphone. A sample rate of 96 kHz was used. Spectral analysis was done using custom-coded software in Matlab. Spectrograms were typically computed using 4096 point window segments with 95% fractional overlap and a Hamming window. Harmonics (black circles in Figure 1) were estimated using a custom-coded peak-picking algorithm. Estimated formant trends (red dashed lines in Figure 1) were determined using a lowpass interpolating filter built into Matlab’s digital signal processing toolbox with a scaling factor of 10. From this trend, the peak-picking was reapplied to determine ‘formant’ frequencies (red ‘x’s in Figure 1). This process could be repeated across the spectrogram to track overtone and formant frequency/strength effectively, as shown in Appendix 1—figure 1.

To quantify the focused states, we developed a dimension-less measure eR(fL,fH) to represent the energy ratio of that spanning a frequency range fHfL relative to the entire spectral output. This can be readily computed from the spectrogram data as follows. First take a ‘slice’ from the spectrogram and convert spectral magnitude to linear ordinate and square it (as intensity is proportional to pressure squared). Then integrate across frequency, first for a limited range spanning [fL,fH] (e.g., 1–2 kHz) and then for a broader range of [0,fmax] (e.g., 0–8 kHz; 8 kHz is a suitable maximum as there is little acoustic energy in vocal output above this frequency). The ratio of these two is then defined as eR

, and takes on values between 0 and 1. This can be expressed more explicitly as: (1) eR(fL,fH)=(∫fHfLP(f)dffmax0P(f)df)2

where P is the magnitude of the scaled sound pressure, f is frequency, and fL and fH are filter limits for considering the focused state. The choice of [fL,fH]=[1,2] kHz has the virtue of spanning an octave, which also closely approximates the ‘seventh octave’ from about C6 to C7. eR did not depend significantly upon the length of the fast Fourier transform (FFT) window. Values of eR

for the waveforms used in Figure 1 are shown in Appendix 1—figures 2 and 3.

MRI acquisition and volumetric analysis

Request a detailed protocol

MRI images were acquired at the York MRI Facility on a 3.0 Tesla MRI scanner (Siemens Magnetom TIM Trio, Erlangen, Germany), using a 12-channel head coil and a neck array. Data were collected with the approval of the York University Institutional Review Board. The participant was fitted with an MRI compatible noise-cancelling microphone (Optoacoustics, Mazor, Israel) mounted directly above the lips. The latency of the microphone and noise-cancelling algorithm was 24 ms. Auditory recordings were made in QuickTime on an iMac during the scans to verify performance.

Images were acquired using one of two paradigms, static or dynamic. Static images were acquired using a T1-weighted 3D gradient echo sequence in the sagittal orientation with 44 slices centered on the vocal tract, TR = 2.35 ms, TE = 0.97 ms, flip angle = 8 degrees, FoV = 300 mm, and a voxel dimension of 1.2 × 1.2×1.2 mm. Total acquisition time was 11 s. The participant was instructed to begin singing a tone, and to hold it in a steady state for the duration of the scan. The scan was started immediately after the participant began to sing and had reached a steady state. Audio recordings verified a consistent tone for the duration of the scan. Dynamic images were acquired using a 2D gradient echo sequence. A single 10.0 mm thick slice was positioned in a sagittal orientation along the midline of the vocal tract, TR = 4.6 ms, TE = 2.04 ms, flip angle = 8 degrees, FoV = 250 mm, and a voxel dimension of 2.0 × 2.0×10.0 mm. One hundred measurements were taken for a scan duration of 27.75 s. The effective frame rate of the dynamic images was 3.6 Hz. Audio recordings were started just prior to scanning. Only subject T2 participated in the MRI recordings. The participant was instructed to sing a melody for the duration of the scan, and took breaths as needed.

For segmentation (Figure 3), 3D MRI images (Run1; see Appendix) were loaded into Slicer (version 4.6.2 r25516). The air-space in the oral cavity was manually segmented using the segmentation module, identified and painted in slice by slice. Careful attention was paid to the parts of the oral cavity that were affected by the artifact from the dental implant. The air cavity was manually repainted to be approximately symmetric in this region using the coronal and axial view (Figure 3A). Once completely segmented, the sections were converted into a 3D model and exported as a STL file. This mesh file was imported into MeshLab (v1.3.4Beta) for cleaning and repairing the mesh. The surface of the STL was converted to be continuous by removing non-manifold faces and then smoothed using depth and Laplacian filters. The mesh was then imported into Meshmixer where further artifacts were removed. This surface-smoothed STL file was finally reimported into Slicer, generating the display in Figure 3B.

Computational modeling

Request a detailed protocol

Measurement of the cross-distance function is illustrated in Figure 4. The inner and outer profiles of the vocal tract were first determined by manual tracing of the midsagittal image. A 2D iterative bisection algorithm (Story, 2007) was then used to find the centerline within the profiles extending from the glottis to the lips, as shown by the white dots in Figure 4A. Perpendicular to each point on the centerline, the distance from the inner to outer profiles was measured to generate the cross-distance function shown in Figure 4B; the corresponding locations of the anatomic landmarks shown in the midsagittal image are also indicated on the cross-distance function.

The cross-distance function, D(x), can be transformed to an approximate area function, A(x), with the relation A(x)=kDα(x), where k and α are a scaling factor and exponent, respectively. If the elements of D(x) are considered to be diameters of a circular cross-section, k=(π/4) and α=2. Although other values of k and α have been proposed to account for the complex shape of the vocal tract cross-section (Heinz and Stevens, 1964; Lindblom and Sundberg, 1971; Mermelstein, 1973), there is no agreement on a fixed set of numbers for each parameter. Hence, the circular approximation was used in this study to generate an estimate of the area function. In Figure 4C, the area function is plotted as its tubular equivalent, where the radii D(x)/2

were rotated about an axis to generate circular sections from the glottis to the lips.

The associated frequency response of that area function is shown in Figure 4D and was calculated with a transmission line approach (Sondhi and Schroeter, 1987; Story et al., 2000), which included energy losses due to yielding walls, viscosity, heat conduction, and acoustic radiation at the lips. Side branches such the piriform sinuses were not considered in detail in this study. The first five formant frequencies (resonances), F1,…,F5

, were determined by finding the peaks in the frequency response functions with a peak-picking algorithm (Titze et al., 1987) and are located at 400, 1065, 1314, 3286, and 4029 Hz, respectively.

To examine changes in pitch, a particular vocal tract configuration was manually ‘designed (Appendix 1—figure 6) such that it included constrictive and expansive regions at locations similar to those measured from the singer (i.e., Figure 4), but to a less extreme degree. We henceforth denote this area function as A0(x), and it generates a frequency response with widely spaced formant frequencies (F1…5=[529,1544,2438,3094,4236]Hz), essentially a neutral vowel. In many of the audio signals recorded from the singer, the fundamental frequency, fo (i.e., the vibratory frequency of the vocal folds), was typically about 150 Hz. The singer then appeared to enhance one of the harmonics in the approximate range of 8fo…12fo. Taking the 12th harmonic (12×150=1800 Hz) as an example target frequency (dashed line in the frequency response shown in Figure 5c), the area function A0(x) was iteratively perturbed by the acoustic-sensitivity algorithm described in Story (2006) until F2 and F3

converged on 1800 Hz and became a single formant peak in the frequency response. Additional details on the perturbation process leading into Figure 5 are detailed in the Appendix.Appendix 1

This appendix contains supporting information for the document Overtone focusing in biphonic Tuvan throat singing by Bergevin et al. Citations here refer to the bibliography of the main document. First (Methodological considerations), we include several methodological components associated with the quantitative analysis of the waveforms, helping illustrate different approaches towards characterizing the acoustic data and rationale underlying control measures. Second (Additional waveform analyses), we include additional plots to support results and discussion in the main text. For example, different spectrograms are presented, as are analyses for additional waveforms. This section also helps to provide additional context for a second independent focused state. The third section (Additional modeling analysis figures) details theoretical components leading into the results of the computational model and how the MRI data constrain the key parameters, justifying arguments surrounding the notion of formant merging. Fourth (Instability in focused state), some speculative discussion and basic modeling aspects are presented with regard to the notion of instabilities present in the motor control of the focused state. In the fifth section (Additional MRI analysis figures), images stemming from the MRI data are presented. Last, the final three sections detail accessing the acoustic waveforms, MRI data files, and waveform analysis (Matlab-based) software via an online repository.

Methodological considerations

Overtone and formant tracking

To facilitate quantification of the waveforms, we custom-coded a peak-picking/tracking algorithm to analyze the time-frequency representations produced by the spectrogram. Appendix 1—figure 1 shows an example of the tracking of the overtones (red dots) and formants (grayscale dots; intensity coded by relative magnitude as indicated by the colorbar). This representation provides an alternative view (compared to Figure 1) to help demonstrate that, by and large, the overtone structure is highly consistent throughout, while the formant structure varies significantly across the transition.  Appendix 1—figure 1

Same as Figure 1 (middle left panel; subject T2, same sound file as shown in the middle panel of Figure 1), except with overtones and estimated formant structure tracked across time.

Quantifying focused states

Appendix 1—figures 2 and 3 show calculation of the energy ratio eR used as a means to quantify the degree of focus. For Appendix 1—figure 2, the waveforms are the same as those shown in Figure 1 (those with slightly different axis limits). In general, we found that eR(1,2) provided a clear means to distinguish the focused state, as values were close to zero prior to the transition and larger/sustained beyond the transition. Singer T2 was an exception. Appendix 1—figure 3 is for singer T2, using the same file (i.e., the transition point into the focused state at between 6 and 7 s in this figure is the same as that shown in the middle panel of Figure 1), but with an expanded timescale to illustrate the larger eR values prior to the transition. This is due to the relatively large amount of energy present between 2.5–4 kHz. We also explored eR(1,2) values in a wide range of phonetic signals, such as child and adult vocalizations, other singing styles (e.g., opera), non-Tuvan singers (e.g., Westerners) performing ‘overtone singing’, and older recordings of Tuvan singers. In general, it was observed that eR(1,2)

was relatively large and sustained across time for focused overtone song, whereas the value was close to zero and/or highly transient for other vocalizations. Appendix 1—figure 2

Same data/layout as in Figure 1 but now showing eR(1,2)
as defined in the ‘Materials and methods’.

These plots show the energy ratio focused between 1–2 kHz. Vertical red dashed lines indicate approximate time of transition into the focused state. An expanded timescale is also shown for singer T2 … Appendix 1—figure 3

Similar to Figure 2 for singer T2 (middle panel), except an expanded time scale is shown to demonstrate the earlier dynamics as this singer approaches the focused state (see T2_5longer.wav).

Control measurements

The waveforms from the Tuvan singers provide an intrinsic degree of control (i.e., voicing not in the focused state). Similar to Figure 1, Appendix 1—figure 4 shows the spectra prior to the transition into the focused state. Although relatively narrow harmonics can be observed, they tend to occur below 1 kHz. Such is consistent with our calculations of eR(1,2)

: prior to a transition into a focused state, this value is close to zero. The exception is singer T2, who instead shows a relatively large amount of energy about 1.8–3 kHz that may have some sort of masking effect (see ‘Discussion’ in the main text,and the ‘Pressed transition’ section below). In addition, Tuvan singer T4, who used a non-Sygyt style (Appendix 1—figure 5) , can also effectively be considered a ‘control’. Appendix 1—figure 4

Stemming directly from Figure 1, the right-hand column now shows a spectrum from a time point prior to transition into the focused state (as denoted by the vertical black lines in the left column). The shape of the spectra from Figure 1 is also included for reference.

Appendix 1—figure 5

Spectrogram for singer T4 singing in non-Sygyt style (first song segment of T2_4shortA.wav sound file). For the spectrogram, 4096 point windows were used for the fast Fourier transform (FFT) with 95% fractional overlap and a Hamming window.

Additional waveform analyses

Other spectrograms

Appendix 1—figure 5 shows spectrogram from singer T4 (T4_shortA.wav) singing in non-Sygyt style. While producing a distinctive sound, note the relative lack of energy above approximately 1 kHz. Appendix 1—figure 6 shows a spectrogram from singer T2 (T2_5.wav) over a longer timescale than that shown in Figure 1. Similarly for Appendix 1—figure 7, but for singer T1. Both of these plots provide a spectral-temporal view of how the singer maintains and modulates the song over the course of a single exhalation. Note both the sudden transitions into different maintained pitches and the briefer transient excursions. Appendix 1—figure 6

Spectrogram of the entire T2_5.wav sound file. The sample rate was 96 kHz. The analysis parameters used were the same as those used for Figure 5.

Appendix 1—figure 7

Spectrogram of the first song segment of the T1_3.wav sound file. The analysis parameters used were the same as those for Figure 5.

Second independent focused state

Appendix 1—figure 8 shows another example of a transition in Sygyt-style song for singer T2, clearly showing a second focused state about 3–3.5 kHz. Two aspects merit highlighting. First, the spectral peaks are not harmonically related: at t=4.5

s, the first focused state is at 1.36 kHz and the other at 3.17 kHz (far from to 2.72 kHz as expected). Second, during the singer-induced pitch change at 3.85 s, the two peaks do not move in unison. Although not ruling out correlations between the two focused states, these observations suggest that they are not simply nor strongly related to one another. Appendix 1—figure 8

Singer T2’s transition into a focused state. Note that while the first focused state transitions from approximately 1.36 to 1.78 kHz, the second state remains nearly constant, decreasing only slightly from 3.32 to 3.17 kHz (T2_1shortB.wav).

Pressed transition

Appendix 1—figure 9 shows a spectrogram and several spectral slices for the sound file in which the voicing was ‘pressed’ (Adachi and Yamada, 1999; Edmondson and Esling, 2006) prior to the transition into the focused state. That is, prior to the 1.8 s mark, voicing is relatively normal. But after that point (prior to the transition into the focused state around 5.4 s, substantial energy appears between 2–4 kHz along with a degree of vibrato. Note, however, that there is no change to the overall overtone structure (e.g., no emergence of subharmonics). The spectrum at t=4.0s

, prior to the transition, provides a useful comparison back to Levin and Süzükei (2006). Specifically, one particular overtone is singled out and highly focused, yet the broadband cluster of overtones about 2.5–4 kHz effectively mask it. It is not until about the 5.4 s mark, when those higher overtones are also brought into focus, that a salient perception of the Sygyt-style emerges. Appendix 1—figure 9

Spectrogram of singer T2 exhibiting pressed voicing heading into transition to focused state (T2_2short.wav).

Additional modeling analysis figures

The measurement of the cross-distance function (as described in the ‘Materials and methods’), along with calculation of the frequency response from an estimate of the area function, suggested that constrictions of the vocal tract in the region of the uvula and alveolar ridge may play a significant role in controlling the spectral focus generated by the convergence of F2 and F3. Assuming that an overtone singer configures the vocal tract to merge these two formants deliberately such that, together, they enhance the amplitude of a selected harmonic of the voice source, the aim was to investigate how the vocal tract can be systematically shaped with precisely placed constrictions and expansions to both merge F2 and F3

into a focused cluster and move the cluster along the frequency axis to allow for selection of a range of voice harmonics.

Appendix 1—figure 11b shows the same area function as that in Appendix 1—figure 11a (see ‘Materials and methods’) but plotted by extending the equivalent radius of each cross-sectional area, outward and inward, along a line perpendicular to the centerline measured from the singer (see Figure 4A), resulting in an inner and outer outline of the vocal tract shape as indicated by the thick black lines. The measured centerline is also shown in the figure, along with anatomic landmarks. As this does not represent a true midsagittal plane, it will be referred to here as a pseudo-midsagittal plot (Story et al., 2001). Appendix 1—figure 10

Overview of source/filter theory, as advanced by Stevens (2000). The left column shows normal phonation, whereas the right indicates one example of a focused state.

Appendix 1—figure 11

Setup of the baseline vocal tract configuration used in the modeling study.

(a) The area function (A0(x)

) is in the lower panel and its frequency response is in the upper panel. (b) The area function from (a) is shown as a pseudo-midsagittal plot (see text).

Appendix 1—figure 12a shows the new area function and frequency response generated by the perturbation process, whereas the pseudo-midsagittal plot is shown in Appendix 1—figure 12b. Relative to the shape of A0(x) (shown as the thin gray line), the primary modification is a severe constriction imposed between 12.5–13.5 cm from the glottis, essentially at the alveolar ridge. Although the line thickness might suggest that the vocal tract is occluded in this region, the minimum cross-sectional area is 0.09 cm2. There is also a more moderate constriction at about 5 cm from the glottis, and a slight expansion between 7–10.5 cm from the glottis. The frequency response in upper panel of Appendix 1—figure 12a demonstrates that the new area function was successful in driving F2 and F3 together to form a single formant peak centered at 1800 Hz, which is at least 15 dB higher in amplitude than any of the other formants. Exactly the same process was used to generate area functions for which F2 and F3 converge on the target harmonic frequencies: 8fo,9fo,10fo,11fo=1200,1350,1500,1650 Hz, respectively. The results, along with those from the previous figure for 12fo, are shown in Appendix 1—figure 13. The collection of frequency responses in the upper panel of Appendix 1—figure 13b shows that F2 and F3 successfully converged to become one formant peak in each of the cases, and their locations on the frequency axis are centered around the specified target frequencies. The corresponding area functions in the lower panel suggests that the constriction between 12.5–13.5 cm from the glottis (alveolar ridge region) is present in roughly the same form for all five cases. By contrast, an increasingly severe constriction must be imposed in the region between 6–8.5 cm from the glottis (uvular region) in order to shift the target frequency (i.e., the frequency at which F2 and F3 converge) downward through progression of specified harmonics. Coincident with this constriction is a progressively larger expansion between 14–15.5 cm from the glottis, which probably assists in positioning the focal regions of F2 and F3 downward. It can also be noted that the area function that generates a focus at 8fo

(1200 Hz; thinnest line) is most similar to the one generated from the cross-distance measurements (i.e., Figure 4c). In both, there are constrictions located at about 7.5 cm and 13 cm from the glottis; the expansions in the lower pharynx and oral cavity are also quite similar. The main difference is the greater expansion of the region between 8–13 cm from the glottis in the acoustically derived area function.

On the basis of the results, a mechanism for controlling the enhancement of voice harmonics can be proposed: the degree of constriction near the alveolar ridge in the oral cavity (labeled Co in Figure 5 of the main text) controls the proximity of F2 and F3 to each other, whereas the degree of constriction near the uvula in the upper pharynx, Cp, controls the frequency at which F2 and F3 converge (the expansion anterior to Co may also contribute). Thus, an overtone singer could potentially ‘play’ (i.e., select) various harmonics of the voice source by first generating a tight constriction in the oral cavity near the alveolar ridge to generate the focus of F2 and F3

, and then modulating the degree of constriction in the uvular region of the upper pharynx to position the focus on a selected harmonic.

This proposed mechanism of controlling the spectral focus is supported by observation of vocal tract changes based on dynamic MRI data sets. Using this approach, midsagittal movies of the Tuvan singer were acquired in which each image represented approximately 275 ms. Shown in Figure 5 is a comparison of vocal tract configurations derived with the acoustic-sensitivity algorithm (middle panels) to image frames from an MRI-based movie (upper panels) associated with the points in time indicated by the vertical lines superimposed across the waveform and spectrogram in the lower part of the figure. The image frames were chosen such that they appeared to be representative of the singer placing the spectral focus at 8fo (left) and 12fo (right), respectively, based on the evidence available in the spectrogram. The model-based vocal tract shape in the upper left panel, derived for a spectral focus of 8fo (1200 Hz), exhibits a fairly severe constriction in the uvular region, similar to the constrictive effect that can be seen in the corresponding image frame (middle left). Likewise, the vocal tract shape derived for a spectral focus of 12fo

(1800 Hz) (upper right) and the image frame just below it both demonstrate an absence of a uvular constriction. Thus, the model-based approach generated vocal tract shapes that appear to possess characteristics similar to those produced by the singer, and provides support for the proposed mechanism of spectral focus control. Appendix 1—figure 12

Results of perturbing the baseline area function A0(x)
so that F2
and F3
converge on 1800 Hz.

(a) Perturbed area function (thick black line) and the corresponding frequency response; for comparison, the baseline area function is also shown (thin gray line). The frequency response shows the … Appendix 1—figure 13

Results of perturbing the baseline area function A0(x)
so that F2
and F3
converge on 1200, 1350, 1500, 1650, and 1800 Hz.

(A) Perturbed area functions and corresponding frequency responses; line thicknesses and gray scale are matched in the upper and lower panels. (B) Pseudo-midsagittal plot of the perturbed area …

Second focused state

Given that singer T2 was the subject for the MRI scans and uniquely exhibited a second focused state (e.g., Appendix 1—figure 8), the model was also utilized to explore how multiple states could be achieved. Two possibilities appear to be the sharpening of formant F4 alone, or the merging of F4 and F5 (Appendix 1—figure 14). However, it is unclear how reasonable those vocal tract configurations may be and further study is required. Appendix 1—figure 14

Similar to Figure 5, but additional manipulations were considered to create a second focused state by merging F4 and F5, as exhibited by singer T2 (see middle row in Figure 1). In addition, the spectrogram shown here is from the model (not the singer’s audio). See also Appendix 1—figure 20 for connections back to dynamic MRI data.

Animations and synthesized song

Animations and audio clips demonstrating various quantitative aspects of the model are included in the data files posted to datadryad.org. Specifically they are:

o Animation (no sound) of vocal tract changes during transition into focused state and subsequent pitch changes – Medley 0 to5 1 cluster.mp4Audioclipof simulatedsong

o Audio clip of simulated song – Medley 0 to5 1 cluster s im.wav

Instability in focused state

Appendix 1—figure 15 and Appendix 1—figure 16 show that brief transient instabilities in the focused state can and do regularly occur. Specifically, it can be observed that there are brief transient lapses while the singer is maintaining the focused overtone condition, thereby providing insight into how focus is actively maintained. One possible explanation is control by virtue of biomechanical feedback, where the focused state can effectively be considered to be an unstable equilibrium point, akin to balancing a ruler vertically on the palm of your hand. An alternative consideration might be that singers learn to create additional quasi-stable equilibrium points (e.g., Appendix 1—figure 17). The sudden transitions observed (Figure 1) could then be likened to two-person cheerleading moves such as a ‘cupie’, where one person standing on the ground suddenly throws another up vertically and has them balancing atop their shoulders or upward-stretched hands. A simple proposed model for the transition into the focused state is shown in Appendix 1—figure 17. There, a stable configuration of the vocal tract would be the low point (pink ball). Learning to achieve a focused state would give rise to additional stable equilibria (red ball), which may be more difficult to maintain. Considerations along these lines, combined with a model for biomechanical control (e.g., Sanguineti et al., 1998), can lead to testable predictions specific to when a highly experienced singer is maintaining balance about the transition point into/out of a focused state (e.g., T2_4.wav audio file). Appendix 1—figure 15

Brief instability in the focused state.

(A) Spectrogram of singer T3 during period during which the focused state briefly falters (T3_2shortB.wav, extracted from around the 33 s mark of T3_2.wav). (B) Spectral slices taken at two … Appendix 1—figure 16

Spectrogram of singer T2 (T2_1shortA.wav) about a transition into a focused state. Note that there is a slight instability around 4.5 s.

Appendix 1—figure 17

Schematic illustrating a simple possible mechanical analogy (ball confined to a potential well) for the transition into a focused state.

Additional MRI analysis figures

Volumetric data

An example of the volumetric data (arranged as tiled midsagittal slices) is shown in Appendix 1—figure 18. Note that the NMR artifact resulting from the presence of a dental post is apparently lateralized to one side.

Appendix 1—figure 19 shows a spectrogram of audio segment (extracted from Run3Vsound.wav) associated with the volumetric scan shown in Appendix 1—figure 18. Segments both with and without the scanner noise are shown.

Vocal tract shape and associated spectrograms

Examples of the vocal tract taken during the dynamic MRI runs (i.e., midsagittal only) are shown for very different representative time points in Appendix 1—figure 20. Appendix 1—figure 18

Mosaic of single slices from the volumetric MRI scan (Run3) of subject T2 during focused overtone state. Spectrogram of corresponding audio shown in Appendix 1—figure 19.

Appendix 1—figure 19

Spectrogram of steady-state overtone voicing assocaited with the volumetric scan shown in Appendix 1—figure 18.

Two different one-second segments are shown: the top segment shows images there were made during the scan (and thus includes acoustic noise from the scanner during image acquisition), while the botto… Appendix 1—figure 20

Representative movie frames and their corresponding spectra for singer T2, as input into modeling parameters (e.g., Figure 5).

The corresponding Appendix data files are DynamicRun2S.mov (MRI images) and DynamicRun2sound.wav (spectra; see also DynamicRun2SGrid.pdf). The top row shows a ‘low pitch’ (first) focused state at …

Data

All data relevant to the study have been placed in the online repository – https://datadryad.org/stash (Bergevin, 2020). Below is a list of the data placed there, along with a brief description (see ‘Materials and methods’ section for additional details).

Acoustic data

All waveforms were obtained at a sample rate of 96 kHz and a bit-depth of 24 bits.

  • T1_1.wav
  • T1_2.wav
  • T1_3.wav
  • T1_3short.wav
  • T2_1.wav
  • T2_1shortA.wav
  • T2_1shortB.wav
  • T2_1shortC.wav
  • T2_2.wav
  • T2_2short.wav
  • T2_3.wav
  • T2_4.wav
  • T2_5.wav
  • T2_5longer.wav
  • T2_5short.wav
  • T3_2.wav
  • T3_2shortA.wav
  • T3_2shortB.wav
  • T4_1.wav
  • T4_1shortA.wav

MRI data

* Images Images were only obtained from singer T2. Note that all image data are saved as DICOM files (i.e., .dcm) :

  • Volumetric Run1
  • Volumetric Run2
  • Volumetric Run3
  • Dynamic midsagittal Run1
  • Dynamic midsagittal Run2
  • Dynamic midsagittal Run3

* Audio recordings acquired during MRI acquisition (see ‘Materials and methods’).

  • Vol. Run1 audio
  • Vol. Run2 audio
  • Vol. Run3 audio
  • Dyn. Run1 audio
  • Dyn. Run2 audio
  • Dyn. Run3 audio

* MRI Midsagittal movies with sound were also created by animating the frames in Matlab and syncing the recorded audio via Wondershare Filmora. They are saved as .mov files (Apple QuickTime Movie files):

  • Dyn. Run1 video
  • Dyn. Run2 video
  • Dyn. Run3 video

To facilitate connecting movie frames back to the associated sound produced by singer T2 at that moment, the movies include frame numbers. Those have been labeled on the corresponding time location in the spectrograms (see red labels at top):

  • Dyn. Run1 spectrogram
  • Dyn. Run2 spectrogram
  • Dyn. Run3 spectrogram

* Segmented volumetric data files (like those shown in Figure 3), data saved as STL files (i.e., .stl):

  • Segmented data (T2)

Software and synthesized song

Simulations and waveform analysis were implemented in Matlab. The TubeTalker software is provided ‘as is’:

  • Code to analyze general aspects of the waveforms (e.g., Figure 1 spectrograms)
  • Code to quantify eR

References

    1. Adachi S
    2. Yamada M
    (1999) An acoustical study of sound production in biphonic singing, xöömij The Journal of the Acoustical Society of America 105:2920–2932. https://doi.org/10.1121/1.426905
    1. Aksenov AN
    (1973) Tuvin folk music Asian Music 4:7–18. https://doi.org/10.2307/833827
    1. Bergevin C
    (authors) (2020) Overtone focusing in biphonic Tuvan throat singing Dryad Digital Repository.
    1. Bernstein JG
    2. Oxenham AJ
    (2003) Pitch discrimination of diotic and dichotic tone complexes: harmonic resolvability or harmonic number? The Journal of the Acoustical Society of America 113:3323–3334. https://doi.org/10.1121/1.1572146
    1. Billig AJ
    2. Davis MH
    3. Deeks JM
    4. Monstrey J
    5. Carlyon RP
    (2013) Lexical influences on auditory streaming Current Biology 23:1585–1589. https://doi.org/10.1016/j.cub.2013.06.042
    1. Bloothooft G
    2. Bringmann E
    3. van Cappellen M
    4. van Luipen JB
    5. Thomassen KP
    (1992) Acoustics and perception of overtone singing The Journal of the Acoustical Society of America 92:1827–1836. https://doi.org/10.1121/1.403839
    1. Bunton K
    2. Story BH
    3. Titze I
    (2013) Estimation of vocal tract area functions in children based on measurement of lip termination area and inverse acoustic mapping Proceedings of Meetings on Acoustics 19:060054. https://doi.org/10.1121/1.4799532
    1. Culling JF
    2. Darwin CJ
    (1993) Perceptual separation of simultaneous vowels: within and across-formant grouping by F0 The Journal of the Acoustical Society of America 93:3454–3467. https://doi.org/10.1121/1.405675
    1. Dang J
    2. Honda K
    (1997) Acoustic characteristics of the piriform Fossa in models and humans The Journal of the Acoustical Society of America 101:456–465. https://doi.org/10.1121/1.417990
    1. Darwin CJ
    (1984) Perceiving vowels in the presence of another sound: constraints on Formant perception The Journal of the Acoustical Society of America 76:1636–1647. https://doi.org/10.1121/1.391610
    1. Darwin JC
    (2005) Pitch and Auditory Grouping In: C. J Plank, R. R Fay, A. J Oxenham, A. N Popper, editors. Handbook of Auditory Research, 24. Springer. pp. 278–305.
    1. Doolittle EL
    2. Gingras B
    3. Endres DM
    4. Fitch WT
    (2014) Overtone-based pitch selection in hermit thrush song: unexpected convergence with scale construction in human music PNAS 111:16616–16621. https://doi.org/10.1073/pnas.1406023111
    1. Edgerton ME
    2. Bless D
    3. Thibeault S
    4. Fagerholm M
    5. Story B
    (1999) The acoustic analysis of reinforced harmonics The Journal of the Acoustical Society of America 105:1329. https://doi.org/10.1121/1.426220
    1. Edmondson JA
    2. Esling JH
    (2006) The valves of the throat and their functioning in tone, vocal register and stress: laryngoscopic case studies Phonology 23:157–191. https://doi.org/10.1017/S095267570600087X
    1. Fee MS
    2. Shraiman B
    3. Pesaran B
    4. Mitra PP
    (1998) The role of nonlinear dynamics of the syrinx in the vocalizations of a songbird Nature 395:67–71. https://doi.org/10.1038/25725
    1. Fitch WT
    2. Neubauer J
    3. Herzel H
    (2002) Calls out of Chaos: the adaptive significance of nonlinear phenomena in mammalian vocal production Animal Behaviour 63:407–418. https://doi.org/10.1006/anbe.2001.1912
    1. Goldberger AL
    2. Amaral LAN
    3. Hausdorff JM
    4. Ivanov PC
    5. Peng C-K
    6. Stanley HE
    (2002) Fractal dynamics in physiology: alterations with disease and aging PNAS 99:2466–2472. https://doi.org/10.1073/pnas.012579499
    1. Grawunder S
    (2009) On the Physiology of Voice Production in South-Siberian Throat Singing: Analysis of Acoustic and Electrophysiological Evidences Frank & Timme.
    1. Heinz JM
    2. Stevens KN
    (1964) On the derivation of area functions and acoustic spectra from cinéradiographic films of speech The Journal of the Acoustical Society of America 36:1037–1038. https://doi.org/10.1121/1.2143313
    1. Herzel H
    2. Reuter R
    (1996) Biphonation in voice signals, American Institute of Physics 375:644–657. https://doi.org/10.1063/1.51002
    1. Kantz H
    2. Schreiber T
    (2004) Nonlinear Time Series Analysis Cambridge University Press. https://doi.org/10.1017/CBO9780511755798
    1. Kingsley EP
    2. Eliason CM
    3. Riede T
    4. Li Z
    5. Hiscock TW
    6. Farnsworth M
    7. Thomson SL
    8. Goller F
    9. Tabin CJ
    10. Clarke JA
    (2018) Identity and novelty in the avian syrinx PNAS 115:10209–10217. https://doi.org/10.1073/pnas.1804586115
    1. Kob M
    (2004) Analysis and modelling of overtone singing in the sygyt style Applied Acoustics 65:1249–1259. https://doi.org/10.1016/j.apacoust.2004.04.010
    1. Leighton R
    (2000) Tuva or Bust!: Richard Feynman’s Last Journey WW Norton & Company.
    1. Levin TC
    2. Edgerton ME
    (1999) The throat singers of tuva Scientific American 281:80–87. https://doi.org/10.1038/scientificamerican0999-80
    1. Levin TC
    2. Süzükei V
    (2006) Where Rivers and Mountains Sing: Sound, Music, and Nomadism in Tuva and Beyond Indiana University Press.
    1. Li G
    2. Hou Q
    (2017) The physiological basis of chinese höömii generation Journal of Voice 31:e16. https://doi.org/10.1016/j.jvoice.2016.03.007
    1. Lindblom BE
    2. Sundberg JE
    (1971) Acoustical consequences of lip, tongue, jaw, and larynx movement The Journal of the Acoustical Society of America 50:1166–1179. https://doi.org/10.1121/1.1912750
    1. Lindestad PA
    2. Södersten M
    3. Merker B
    4. Granqvist S
    (2001) Voice source characteristics in mongolian “throat singing” studied with high-speed imaging technique, acoustic spectra, and inverse filtering Journal of Voice 15:78–85. https://doi.org/10.1016/S0892-1997(01)00008-X
    1. Mahrt E
    2. Agarwal A
    3. Perkel D
    4. Portfors C
    5. Elemans CP
    (2016) Mice produce ultrasonic vocalizations by intra-laryngeal planar impinging jets Current Biology 26:R880–R881. https://doi.org/10.1016/j.cub.2016.08.032
    1. Mergell P
    2. Herzel H
    (1997) Modelling biphonation — The role of the vocal tract Speech Communication 22:141–154. https://doi.org/10.1016/S0167-6393(97)00016-2
    1. Mermelstein P
    (1973) Articulatory model for the study of speech production The Journal of the Acoustical Society of America 53:1070–1082. https://doi.org/10.1121/1.1913427
    1. Plack CJ
    2. Oxenham AJ
    (2005) The Psychophysics of Pitch 7–55, Pitch, The Psychophysics of Pitch, Springer, 10.1007/0-387-28958-5_2.
    1. Remez RE
    2. Rubin PE
    3. Pisoni DB
    4. Carrell TD
    (1981) Speech perception without traditional speech cues Science 212:947–949. https://doi.org/10.1126/science.7233191
    1. Remez RE
    2. Pardo JS
    3. Piorkowski RL
    4. Rubin PE
    (2001) On the bistability of sine wave analogues of speech Psychological Science 12:24–29. https://doi.org/10.1111/1467-9280.00305
    1. Roberts B
    2. Summers RJ
    3. Bailey PJ
    (2015) Acoustic source characteristics, across-formant integration, and speech intelligibility under competitive conditions Journal of Experimental Psychology: Human Perception and Performance 41:680–691. https://doi.org/10.1037/xhp0000038
    1. Sanguineti V
    2. Laboissière R
    3. Ostry DJ
    (1998) A dynamic biomechanical model for neural control of speech production The Journal of the Acoustical Society of America 103:1615–1627. https://doi.org/10.1121/1.421296
    1. Shamma SA
    2. Elhilali M
    3. Micheyl C
    (2011) Temporal coherence and attention in auditory scene analysis Trends in Neurosciences 34:114–123. https://doi.org/10.1016/j.tins.2010.11.002
    1. Sondhi M
    2. Schroeter J
    (1987) A hybrid time-frequency domain articulatory speech synthesizer IEEE Transactions on Acoustics, Speech, and Signal Processing 35:955–967. https://doi.org/10.1109/TASSP.1987.1165240
    1. Stevens KN
    (2000) Acoustic Phonetics MIT press.
    1. Story BH
    2. Titze IR
    3. Hoffman EA
    (1996) Vocal tract area functions from magnetic resonance imaging The Journal of the Acoustical Society of America 100:537–554. https://doi.org/10.1121/1.415960
    1. Story BH
    2. Laukkanen AM
    3. Titze IR
    (2000) Acoustic impedance of an artificially lengthened and constricted vocal tract Journal of Voice 14:455–469. https://doi.org/10.1016/S0892-1997(00)80003-X
    1. Story BH
    2. Titze IR
    3. Hoffman EA
    (2001) The relationship of vocal tract shape to three voice qualities The Journal of the Acoustical Society of America 109:1651–1667. https://doi.org/10.1121/1.1352085
    1. Story BH
    (2006) Technique for “tuning” vocal tract area functions based on acoustic sensitivity functions The Journal of the Acoustical Society of America 119:715–718. https://doi.org/10.1121/1.2151802
    1. Story BH
    (2007) Time dependence of vocal tract modes during production of vowels and vowel sequences The Journal of the Acoustical Society of America 121:3770–3789. https://doi.org/10.1121/1.2730621
    1. Story BH
    (2013) Phrase-level speech simulation with an airway modulation model of speech production Computer Speech & Language 27:989–1010. https://doi.org/10.1016/j.csl.2012.10.005
    1. Story BH
    (2016) The Oxford Handbook of Singing The vocal tract in singing, The Oxford Handbook of Singing, Oxford University Press, 10.1093/oxfordhb/9780199660773.013.012.
    1. Summerfield Q
    2. Culling JF
    (1992) Auditory segregation of competing voices: absence of effects of FM or AM coherence Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 336:357–365. https://doi.org/10.1098/rstb.1992.0069
    1. Suthers RA
    2. Narins PM
    3. Lin WY
    4. Schnitzler HU
    5. Denzinger A
    6. Xu CH
    7. Feng AS
    (2006) Voices of the dead: complex nonlinear vocal signals from the larynx of an ultrasonic frog Journal of Experimental Biology 209:4984–4993. https://doi.org/10.1242/jeb.02594
    1. Theiler J
    2. Eubank S
    3. Longtin A
    4. Galdrikian B
    5. Doyne Farmer J
    (1992) Testing for nonlinearity in time series: the method of surrogate data Physica D: Nonlinear Phenomena 58:77–94. https://doi.org/10.1016/0167-2789(92)90102-S
    1. Titze IR
    2. Horii Y
    3. Scherer RC
    (1987) Some technical considerations in voice perturbation measurements Journal of Speech, Language, and Hearing Research 30:252–260. https://doi.org/10.1044/jshr.3002.252
    1. Titze I
    2. Riede T
    3. Popolo P
    (2008) Nonlinear source–filter coupling in phonation: Vocal exercises The Journal of the Acoustical Society of America 123:1902–1915. https://doi.org/10.1121/1.2832339
    1. Titze IR
    2. Story BH
    (1997) Acoustic interactions of the voice source with the lower vocal tract The Journal of the Acoustical Society of America 101:2234–2243. https://doi.org/10.1121/1.418246
    1. Tokuda I
    2. Riede T
    3. Neubauer J
    4. Owren MJ
    5. Herzel H
    (2002) Nonlinear analysis of irregular animal vocalizations The Journal of the Acoustical Society of America 111:2908–2919. https://doi.org/10.1121/1.1474440
    1. Zollinger SA
    2. Riede T
    3. Suthers RA
    (2008) Two-voice complexity from a single side of the syrinx in northern mockingbird Mimus polyglottos vocalizations Journal of Experimental Biology 211:1978–1991. https://doi.org/10.1242/jeb.014092

Decision letter

  1. Timothy D Griffiths Reviewing Editor; University of Newcastle, United Kingdom
  2. Barbara G Shinn-Cunningham Senior Editor; Carnegie Mellon University, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Tuvan throat singing, in which people are able to simultaneously produce and independently control two different distinct pitches using the human vocal apparatus, has fascinated hearing and speech researchers for decades. This careful study examines the acoustics of the produced sound and offers new insights into why the produced sound results in two distinct, separately controllable pitches.

Decision letter after peer review:

Thank you for submitting your article “Overtone focusing in biphonic Tuvan throat singing” for consideration by eLife. Your article has been reviewed by two peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Barbara Shinn-Cunningham as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another, and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary

We enjoyed this work addressing mechanisms by which throat singers produce dual pitches that assesses the mechanism for this in terms of the ways in which the vocal tract is precisely controlled based on MRI videos. The work mentions other biological examples of dual fundamentals in songbirds for the broad eLife audience. One of the issues that came up in discussion was the control for normal vocalisations without biphonation, but I think the authors make a reasonable case that the singers act as their own controls. Basically, the work shows that the dual pitch mechanism is associated with changes in the vocal tract morphology based on two constrictions that merge second and third formants and is associated with what they call a ‘focussed state’ in which the harmonics at 1.5kHz to 2kHz are accentuated. The idea as I understand it is that this accentuates a single harmonic of the fundamental glottal pulse rate so that a new high frequency component of Khoomei emerges that is in effect perceptually ‘released’ from the harmonic series to allow the emergence of the high pitched whistling part of the song.

Major comments

1) From first principles, dual pitch singing could be achieved by a different type of glottal pulse generation in the larynx so that two vibration modes were present (as in avian syrinx). This is not the mechanism suggested here, and it is hard to see how the anatomy and physiology of the human larynx might allow this, but this has not been directly examined in the MRI work. The authors carried out a careful acoustic analysis which shows only one harmonic series before and after transitions to throat singling (without shifting), which I think is adequate. But they might comment on the other possible mechanism for biological readers, if only to dismiss it.

2) Both reviewers thought the discussion of the basis for the perceived dual pitch was not clear. The authors discuss differences in cochlear mechanisms between low frequency regions and high frequency regions. More effort could be made to explain how the dual pitch, which is attributed to a type of spectral emphasis, can be reconciled with current models of pitch perception. The fundamental for the singers assessed was ~150Hz so that the >1.5 kHz region will be unresolved (H10 and above). The greatest contribution to the salience of the low pitch will be the resolved harmonics at frequencies below the focus region, which are well represented. The high-frequency harmonics will usually contribute (weakly) to the low pitch based on the temporal firing patterns due to merged harmonics in frequency bands. The authors appear to be arguing that a different spectral pitch emerges in the high frequency focussed region, distinct from that associated with the lower harmonics.

3) The argument about decreased phase locking at high frequency was not convincing: this occurs in a much higher frequency region that the focussed region. The argument that the high pitch was not easily explained by a non-linear distortion was convincing.

4) In conclusion, we thought the work nicely shows the changes in vocal tract morphology and associated spectrum as an explanation for the dual pitch, but more teasing out of mechanism for the dual pitch perception is required in a way that might be accessible to readers.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article “Overtone focusing in biphonic Tuvan throat singing” for consideration by eLife. Your article has been reviewed by a member of our Board of Reviewing Editors and Barbara Shinn-Cunningham as the Senior Editor.

We are afraid we are still not satisfied with the discussion of the basis for the dual pitch at the end of the Discussion. The authors have demonstrated a region of spectral focus as a proposed mechanism for the new pitch. But we still do not understand why the focussed overtones produce a different pitch. They are still harmonics of the same fundamental and interactions between them in this unresolved region would be expected to produce beating at the same frequency as the fundamental, in the absence of non-linear mechanisms. We also do not understand what the additional relevance of a decrease in phase locking in this region would be that the authors highlight. Are the authors claiming that the focused region produces spectral excitation in a region without the usual coding of beating between harmonics (because of decreased phase locking) and that this is the cause of the new pitch? If so an explicit suggestion along those lines might help readers who are familiar with conventional pitch models.

eLife does not usually encourage multiple rounds of revision but this is a critical point in the interpretation of an interesting study, and I would encourage a revision with a much shorter final section of Discussion that explains a clear hypothesis related to the cause of the new pitch.https://doi.org/10.7554/eLife.50476.sa1Author response

Major comments

1) From first principles, dual pitch singing could be achieved by a different type of glottal pulse generation in the larynx so that two vibration modes were present (as in avian syrinx). This is not the mechanism suggested here, and it is hard to see how the anatomy and physiology of the human larynx might allow this, but this has not been directly examined in the MRI work. The authors carried out a careful acoustic analysis which shows only one harmonic series before and after transitions to throat singling (without shifting), which I think is adequate. But they might comment on the other possible mechanism for biological readers, if only to dismiss it.

We attempted to further clarify the point (that we saw no evidence for a nonlinear source mechanism) and added an additional line of text as per the suggestion.

2) Both reviewers thought the discussion of the basis for the perceived dual pitch was not clear. The authors discuss differences in cochlear mechanisms between low frequency regions and high frequency regions. More effort could be made to explain how the dual pitch, which is attributed to a type of spectral emphasis, can be reconciled with current models of pitch perception. The fundamental for the singers assessed was ~150Hz so that the >1.5 kHz region will be unresolved (H10 and above). The greatest contribution to the salience of the low pitch will be the resolved harmonics at frequencies below the focus region, which are well represented. The high-frequency harmonics will usually contribute (weakly) to the low pitch based on the temporal firing patterns due to merged harmonics in frequency bands. The authors appear to be arguing that a different spectral pitch emerges in the high frequency focussed region, distinct from that associated with the lower harmonics.

This criticism was given particularly serious thought and consideration. As a result, we totally rewrote this section to make the proposed ideas clearer, as well as accessible to a broad readership. We tried to find a better balance between issues/questions related to pitch coding and those to cochlear mechanics.

3) The argument about decreased phase locking at high frequency was not convincing: this occurs in a much higher frequency region that the focussed region. The argument that the high pitch was not easily explained by a non-linear distortion was convincing.

As alluded to in the comments above, we clarified the nature of the argument (re phase locking) by expanding upon the discussion of pitch coding. While a reasonable degree of phase locking would still be expected around the 1.5-2 kHz region, this is also where temporal coding starts to fall off dramatically (e.g., Verschooten et al., 2018, PLoS Biol.). That facet, that in the 1-2 kHz region of the human cochlea the fidelity of timing information changes, is what is relevant to the narrative thread here.

4) In conclusion, we thought the work nicely shows the changes in vocal tract morphology and associated spectrum as an explanation for the dual pitch, but more teasing out of mechanism for the dual pitch perception is required in a way that might be accessible to readers.

See comments above.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

[…]

eLife does not usually encourage multiple rounds of revision but this is a critical point in the interpretation of an interesting study, and I would encourage a revision with a much shorter final section of Discussion that explains a clear hypothesis related to the cause of the new pitch.

As we would like to see this work published with eLife, we have drastically truncated the highlighted section to create “a much shorter final section” as suggested. Given our lack of expertise in pitch perception, coupled with our appreciation for the comments raised, we instead (succinctly) reframed through the lens of looking ahead at future work. Specifically, we include only what we think are some quite interesting and provocative parallels we have observed between Sygyt song and cochlear mechanics. We believe that providing this as a summary to the narrative will help stimulate crosstalk between emerging viewpoints in cochlear mechanics and central processing (e.g., pitch perception).

As such, hopefully we have a more streamlined “story” that will be sufficient for

publication. We believe the rest of the work paints a clear picture as to how the

morphology leads to biphonation and that can stand on its own without over speculation on other facets.https://doi.org/10.7554/eLife.50476.sa2Article and author information

Author details

  1. Christopher Bergevin
    1. Physics and Astronomy, York University, Toronto, Canada
    2. Centre for Vision Research, York University, Toronto, Canada
    3. Fields Institute for Research in Mathematical Sciences, Toronto, Canada
    4. Kavli Institute of Theoretical Physics, University of California, Santa Barbara, United States
    Contribution Conceptualization, Data curation, Software, Formal analysis, Investigation, Visualization, Methodology For correspondence cberge@yorku.ca Competing interests No competing interests declared

0000-0002-4529-399X

Chandan Narayan

Languages, Literatures and Linguistics, York University, Toronto, Canada

Contribution

Conceptualization, Investigation

Competing interests

No competing interests declared

Joy Williams

York MRI Facility, York University, Toronto, Canada

Contribution

Investigation, Methodology

Competing interests

No competing interests declared

Natasha Mhatre

Biology, Western University, London, Canada

Contribution

Investigation, Visualization, Methodology

Competing interests

No competing interests declared 0000-0002-3618-306X

Jennifer KE Steeves

  1. Centre for Vision Research, York University, Toronto, Canada
  2. Psychology, York University, Toronto, Canada
Contribution

Investigation, Methodology

Competing interests

No competing interests declared 0000-0002-7487-4646

Joshua GW Bernstein

National Military Audiology & Speech Pathology Center, Walter Reed National Military Medical Center, Bethesda, United States

Contribution

Investigation, Writing – original draft

Competing interests

No competing interests declared

Brad Story

Speech, Language, and Hearing Sciences, University of Arizona, Tucson, United States

Contribution

Software, Formal analysis, Investigation, Visualization, Methodology

For correspondence

bstory@email.arizona.edu

Competing interests

No competing interests declared

  1. 0000-0002-6530-8781

Funding

Natural Sciences and Engineering Research Council of Canada (RGPIN-430761-2013)

  • Christopher Bergevin

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

A heartfelt thank you to Huun Huur Tu, without whom this study would not have been possible. Input/suggestions from Ralf Schlueter, Greg Huber, Dorothea Kolossa, Chris Rozell, Tuomas Virtanen, and the reviewers are gratefully acknowledged. Support from York University, the Fields Institute for Research in Mathematical Sciences, and the Kavli Institute of Theoretical Physics is also gratefully acknowledged. CB was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Grant RGPIN-430761–2013. The identification of specific products or scientific instrumentation does not constitute endorsement or implied endorsement on the part of the author, Department of Defense, or any component agency. The views expressed in this article are those of the authors and do not reflect the official policy of the Department of Army/Navy/Air Force, the Department of Defense, or the U.S. Government.

Ethics

Human subjects: Data were collected with approval of the York University Institutional Review Board (IRB protocol to Prof Jennifer Steeves) This study was approved by the Human Participants Review Board of the Office of Research Ethics at York University (certificate #2017-132) and adhered to the tenets of the Declaration of Helsinki. All participants gave informed written consent and consent to publish prior to their inclusion in the study.

Senior Editor

  1. Barbara G Shinn-Cunningham, Carnegie Mellon University, United States

Reviewing Editor

  1. Timothy D Griffiths, University of Newcastle, United Kingdom

Publication history

  1. Received: July 24, 2019
  2. Accepted: January 31, 2020
  3. Accepted Manuscript published: February 12, 2020 (version 1)
  4. Accepted Manuscript updated: February 17, 2020 (version 2)
  5. Version of Record published: March 10, 2020 (version 3)

Copyright

© 2020, Bergevin et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.Metrics

  • 3,478 Page views
  • 164 Downloads
  • 1 Citations

(Monthly)Page Views05010015020025030001/2102/2103/21

Navigate left icon

  1. Daily
  2. Monthly

Downloads(Monthly)012345601/2102/2103/21

Navigate left icon

  1. Daily
  2. Monthly

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Categories and tags

Research organism

    1. Related to
    Speech Biomechanics: Shaping new sounds Timothy D Griffiths et al. Insight Feb 12, 2020
  1. Further reading

Further reading

  1. Listen to Chandan Narayan discuss throat singing Podcast A new study reveals how throat singing is produced.
    1. Physics of Living Systems
    Decoding the physical principles of two-component biomolecular phase separation Yaojun Zhang et al. Research Article Mar 11, 2021 Cells possess a multiplicity of non-membrane-bound compartments, which form via liquid-liquid phase separation. These condensates assemble and dissolve as needed to enable central cellular functions. One important class of condensates is those composed of two associating polymer species that form one-to-one specific bonds. What are the physical principles that underlie phase separation in such systems? To address this question, we employed coarse-grained molecular dynamics simulations to examine how the phase boundaries depend on polymer valence, stoichiometry, and binding strength. We discovered a striking phenomenon – for sufficiently strong binding, phase separation is suppressed at rational polymer stoichiometries, which we termed the magic-ratio effect. We further developed an analytical dimer-gel theory that confirmed the magic-ratio effect and disentangled the individual roles of polymer properties in shaping the phase diagram. Our work provides new insights into the factors controlling the phase diagrams of biomolecular condensates, with implications for natural and synthetic systems.

Be the first to read new articles from eLife

Sign up for alerts

Privacy notice

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation

https://elifesciences.org/articles/50476

Tisato G., Ricci Maccarini A., Tran Quang Hai (2001), “Caratteristiche fisiologiche e acustiche del canto difonico”

Standard

TQH TISATO 2004

Trân Quang Hai & Graziano Tisato in Venice, 2004

dav

Graziano Tisato & Trân Quang Hai in Padova, 13 october 2017

 

4-Andrea-Ricci-Maccarini-2

Dr. Andrea Ricci Maccarini

Click on this link below to read the integral article illustrated with spectral & acoustical analyses :

Tisato G., Ricci Maccarini A., Tran Quang Hai (2001), “Caratteristiche fisiologiche e acustiche del canto difonico”

II Convegno Internazionale di Foniatria – Ravenna 19 ottobre 2001

Caratteristiche fisiologiche e acustiche del Canto Difonico

Graziano G. Tisato, Andrea Ricci Maccarini, Tran Quang Hai

Introduzione
Il Canto Difonico (Overtone Singing o Canto delle Armoniche) è una tecnica di canto
affascinante dal punto di vista musicale, ma particolarmente interessante anche dal punto di vista scientifico. In effetti con questa tecnica si ottiene lo sdoppiamento del suono vocale in due suoni distinti: il più basso corrisponde alla voce normale, nel consueto registro del cantante, mentre il più alto è un suono flautato, corrispondente ad una delle parziali armoniche, in un registro acuto (o molto acuto). A seconda dell’altezza della fondamentale, dello stile e della bravura, l’armonica percepita può andare dalla seconda alla 18° (e anche oltre).
Per quanto riguarda la letteratura scientifica, il Canto Difonico compare per la prima
volta in una memoria presentata da Manuel Garcia di fronte all’Accademia delle Scienze a Parigi il 16 novembre 1840, relativa alla difonia ascoltata da cantanti Bashiri negli Urali (Garcia, 1847).
In un trattato di acustica pubblicato qualche decennio più tardi (Radau, 1880), la realtà di questo tipo di canto è messa in discussione: “…Si deve classificare fra i miracoli ciò che Garcia racconta dei contadini russi da cui avrebbe sentito cantare contemporaneamente una melodia con voce di petto e un’altra con voce di testa”.
Deve trascorrere quasi un secolo dal 1840 prima che si ottenga un riscontro obbiettivo
della verità del rapporto di Garcia, con le registrazioni fatte nel 1934, fra i Tuva, da etnologi russi. Di fronte all’evidenza della analisi compiuta nel 1964 da Aksenov su quelle registrazioni, i ricercatori cominciarono a prendere in considerazione il problema del Canto Difonico (Aksenov, 1964, 1967, 1973). Aksenov è il primo ad attribuire la spiegazione del fenomeno al filtraggio selettivo dell’inviluppo formantico del tratto vocale sul suono glottico, e a paragonarlo allo scacciapensieri (con la differenza che la lamina di questo strumento può ovviamente produrre solo una fondamentale fissa). In quel periodo compare anche un articolo sul Journal of Acoustical Society of America (JASA) sulla difonia nel canto di alcune sette buddiste tibetane, in cui gli autori interpretano correttamente l’azione delle formanti sulla sorgente glottica, senza
tuttavia riuscire a spiegare come i monaci possano produrre fondamentali così basse (Smith et al., 1967).
A partire dal 1969, Leipp con il Gruppo di Acustica Musicale (GAM) dell’Università
Paris VI s’interessa al fenomeno dal punto di vista acustico (Leipp, 1971). Tran Quang Hai, del Musée de l’Homme di Parigi, intraprende in quel periodo una serie di ricerche sistematiche, che portano alla scoperta della presenza del Canto Difonico in un numero insospettato di tradizioni culturali diverse (Tran Quang,1975, 1980, 1989, 1991a, 1991b, 1995, 1998, 1999, 2000, e il sito Web http://www.baotram.ovh.org). L’aspetto distintivo della ricerca di Tran Quang Hai è la sperimentazione e verifica sulla propria voce delle diverse tecniche e stili di canto, che gli ha permesso la messa a punto di metodi facili di apprendimento (Tran Quang, 1989). Nel 1989 Tisato analizza e sintetizza il Canto Difonico con un modello LPC, dimostrando per questa via che la percezione degli armonici dipende esclusivamente dalle risonanze del tratto vocale (Tisato, 1989a, 1991). Nello stesso anno anche il rilevamento endoscopico delle corde vocali di Tran Quang Hai confermava la normalità della vibrazione laringea (Sauvage, 1989, Pailler, 1989). Nel 1992 compare uno studio più approfondito dal punto di vista fonetico e percettivo,
che mette in risalto la funzione della nasalizzazione nella percezione della difonia, la presenza di una adduzione molto forte delle corde vocali e una loro chiusura prolungata (Bloothooft et al., 1992). Gli autori contestano l’ipotesi fatta da Dmitriev che il Canto Difonico sia una diplofonia, con due sorgenti sonore prodotte dalle vere e dalle false corde vocali (Dmitriev et al., 1983). Nel 1999 Levin pubblica sul sito Web di Scientific American un articolo particolarmente interessante per gli esempi musicali che si possono ascoltare, le radiografie filmate della posizione degli articolatori e della lingua, e la spiegazione delle tecniche di produzione dei vari stili del Canto Difonico (Levin et al., 1999, http://www.sciam.com/1999/0999issue/0999levin.html).
Il lavoro che presentiamo qui è il risultato di una recente sessione di lavoro con Tran
Quang Hai (ottobre 2001), in cui abbiamo esaminato i meccanismi di produzione del canto difonico con fibroendoscopia. La strumentazione utilizzata era costituita da un fibroendoscopio flessibile collegato ad una fonte di luce stroboscopica, per valutare quello che succedeva a livello della faringe e della laringe, e un’ottica rigida 0°, collegata ad una fonte di luce alogena, per esaminare il cavo orale.

Gerrit Bloothooft, Eldrid Bringmann, Marieke van Cappellen, Jolanda B. van Luipen, and Koen P. Thomassen : A phonetic study of overtone singing

Standard

Abstract

We describe the phenomenon of overtone singing in terms of the classical theory of speech production. The overtone sound stems from the second formant or a combination of both the second and third formants, as the result of careful, rounded articulation from //, via schwa // to /y/ and /i/. Strong nasalisation provides, at least for the lower overtones, an acoustic separation between the second and first formants, and can also reduce the amplitude of the first formant. The bandwidth of the overtone peak is remarkably small and suggests a firm and relatively long closure of the glottis during overtone phonation. Perception experiments showed that listeners categorize the overtone sounds differently from normally sung vowels.

A phonetic study of overtone singing

 

Gerrit Bloothooft, Eldrid Bringmann, Marieke van Cappellen, Jolanda B. van Luipen, and Koen P. Thomassen

 

 

Research Institute for Language and Speech, University of Utrecht
Trans 10, 3512 JK Utrecht, The Netherlands

 


1. Introduction

Overtone singing is a special type of voice production resulting in a very pronounced, high and separate tone which can be heard over a more or less constant base sound. The technique is rarely used in Western music but in Asia (especially Mongolia and Tibet) it is more common and overtone singing can be heard during secular and religious festivities. The high tone follows a characteristic musical scale [for instance, for pitch C3 (130.8 Hz) (- and + indicate a deviation from the exact tone): C3, C4, G4, C5, E5-, G5, A5+, C6, D6, E6-, F6+, G6, G#6+, A6+, B6-, C7,… ], from which it can be concluded that one really hears an overtone of the fundamental.

The literature contains only a few reports on overtone singing [1,5,7,8], which indicate both the importance of formants and register type. In this paper we present both an acoustic analysis of overtone singing and a study to evaluate the perception of the overtone sounds, in relation to normally sung vowels.


2. Material

We have recorded series of sung overtones from a singer with many years of experience in overtone singing, both as a performer and as a teacher. In this paper we describe the results for an Fo value of 138 Hz (C#3). In addition, 12 Dutch vowels /a/, /a/, //, /o/, /e/, //, //, /i/, /oe/, //, /u/, and /y/, sung in a normal way at the same Fo, were recorded.


3. Acoustic analysis

The recordings were digitized at a rate of 10 kHz and stored in a computer. From the middle, stable, part of each recording 300 ms was segmented. Average power spectra were obtained from FFT analyses (1024 points, shift 6.4 ms) over this segment. Formant frequencies were computed on the basis of appropriate LPC or ARMA analysis.


3.1. FFT-Spectra

Figure 1 shows the average FFT spectra of all overtone recordings. Despite the averaging procedure, the width of each individual harmonic is limited, indica-ting the stability of Fo over the interval (standard deviation of Fo was less than 0.1 semitone in all cases). It can be seen from the shifting peak in the spectra that overtone singing seems interpretable as a special use of a formant. Obviously, the singer tries to match a formant with the intended overtone frequency and succeeds very well.

Frequency (kHz)

FIG. 1. Average FFT spectra for overtone sounds, sung at Fo = 138 Hz (C#3). The overtone sounds are numbered according to the main partial involved.

3.2. Formant frequency analysis

In Fig. 2 we present formant frequency results for both the overtone sounds and the sung vowels in the F1 – F2 plane. The figure shows two modes in the production: firstly, the overtone sounds 4-6 around /u/, and secondly, the track from // to /i/.

In the first mode, it can be seen from the FFT-spectra that there is energy absorbtion around 400 Hz, indicating a strong nasalisation. The characteristic overtone sound resides in the second formant, as others [1,8] had already suggested. The bandwidth of the second formant is very narrow and, especially for the lower overtones, seldom exceeds 40 Hz. This indicates little acoustic damping in production: firm glottal closure and small losses in the vocal tract. All these characteristics indicate a low, rounded, nasalised, back vowel /u/ or // (low F1 and F2, a nasal pole/zero pair, and suppressed F3 [3]).

The second mode in the production of an overtone sound, applies for overtone frequencies higher than 800 Hz. The main peak of the spectrum still rises in tune with the intended overtone frequency and is interpreted as a combination of F2 and F3. It may be of interest that the singer explains this series of overtones with the articulatory variation during the word ‘worry’. It is known, already from the Peterson and Barney data, that in a retroflex /r/ the F3 frequency can be remarkably low and can approach the F2 frequency. This has also been mentioned by Stevens (1989), especially in combination with liprounding, while Sundberg (1987) mentioned the effect as the acoustic result of a larger cavity directly behind the front teeth.

For the higher overtone sounds, the articulation comes near /y/ and /i/, where continued lip rounding makes it possible to bring F2 and F3 together [4], although for the highest overtones a subtle lip spread may be needed to reduce the front cavity to a minimum.

 

FIG. 2. F1 – F2 plane for stimuli sung at Fo = 138 Hz, with positions of the vowels (IPA symbols) and overtone sounds (represented by the number of the corresponding partial).


3.3. The glottal factor

The very narrow bandwidth of the “overtone formant” suggests a good and long glottal closure. We believe that the singer used modal register, with a relatively long glottal closure, originating from a firm glottal adduction. This hypothesis does not exclude that performers may use the vocal fry register as well [7]. In all cases, the long glottal closure requires a strong adduction of the vocal folds, which could easily result in general muscular hypertension in the pharyngeal region. This may relate to the prominent role of the buccal cavity, suggested by Hai (1991).


3.4. Intensity analysis

Up to an overtone frequency of 1.5 kHz, the overtone harmonic has a stable relative intensity of -10 dB relative to overall SPL, and dominates the spectrum. For higher frequencies, the relative level of the overtone harmonic sharply drops with a slope of about -18 dB/octave.


4. The perception of overtone singing

4.1. Material, listening experiment, and analysis

As stimuli we used the combined set of 14 overtone sounds and 12 Dutch vowels. From these stimuli we used the same segment (300 ms) as had been used for the acoustical analyses, but we shaped the first and final 25 ms sinusoidally to avoid the perception of clicks. In a computer-controlled experiment, these stimuli were judged by fifteen listeners on ten 7-point bipolar semantic scales. Further details of semantic scales will be presented in a forthcoming paper. The judgements were analyzed by means of multidimensional preference analysis MDPREF [2]. In the technique of MDPREF a stimulus space is constructed in which distance corresponds to perceptual (dis)similarity.


4.2. The perceptual stimulus space

The plane of the first two dimensions of the stimulus space is shown in Fig. 3. 41 % of the total variation in the judgements was explained in this plane, while higher dimensions each explained less than 6.3 %.

 

FIG. 3. The perceptual stimulus space. The overtone sounds are given by the number of their corresponding partial, the vowels by their IPA symbol.

The overtone sounds and normally sung vowels are perceptually separated clusters. The vowels are situated roughly in a triangle, with the cardinal vowels /i/, /u/, and /a/ at the angles. The overtone sounds are roughly ordered according to their harmonic number, although the stimuli numbered from 4 to 10 can be described as a cluster. This probably relates to the constant relative energy of the overtone harmonic for this set. The direction of the overtone sounds is, from the lower to the higher numbers, about the same as from /u/ to /i/, as may be expected from the relation between harmonic numbers and F2 frequency values.


4.3. A physical description of the perceptual stimulus space

We attempted to match the perceptual stimulus space with multidimensional physical descriptions of the stimuli [formant frequency space (see Fig. 2), 1/3-octave bandfilter energy space both by means of the Plomp metric and the Klatt metric [2,6]]. These attempts were not successful (low correlations between coordinate values along dimensions) because of the division into two clusters of the stimulus space, for which these metrics do not present an explanation. Some additional perceptual sensitivity to the very small bandwidth of the “overtone formant”, which clearly physically separates overtone sounds and normally sung vowels, seems necessary to explain the results.


5. REFERENCES

[1] Barnett, B.M. (1977), “Aspects of vocal multiphonics”, Interface 6, 117-149.
[2] Bloothooft, G. and Plomp, R. (1988), “The timbre of sung vowels”, JASA 84, 847-860.
[3] Fant, G. (1960), ” Acoustic theory of speech production” The Hague: Mouton.
[4] Fujimora, O., and Lindquist, J. (1970), “Sweep-tone measurements of vocal tract characteristics”, JASA 49, 541-558.
[5] Hai, T.Q. (1991), “New experiments about the Overtone Singing Style”, Proc. Conference ‘New ways of the voice’, Becançon, 61.
[6] Klatt, D.H. (1982), “Prediction of perceived phonetic distance from critical-band spectra: a first step”, Proc. ICASSP, Paris, 1278-1281.
[7] Large, J. and Murry, T. (1981), “Observations on the nature of Tibetan chant”, J. of Exp. Research in Singing 5, 22-28.
[8] Smith, H., Stevens, K.N., and Tomlinson, R.S. (1967), “On an unusual mode of chanting by certain tibetan lamas”, JASA 41, 1262-1264.
[9] Stevens, K.N. (1989), “On the quantal nature of speech”, J. of Phonetics 17, 3-45.
[10] Sundberg, J. (1987), “The science of the singing voice“, Dekalb: Northern Illinois University

 http://www.let.uu.nl/~Gerrit.Bloothooft/personal/Publications/ICPhS1991overtones.htm