Semi-Automatic Prosodic Transcription of Spoken Spanish in XML

Eduardo Velázquez

Ph.D. Student, Freie Universität Berlin
CONACYT Scholar (Consejo Nacional de Ciencia y Tecnología, Mexico)

utka@yahoo.com

Abstract

XML (Extensible Mark-up Language) is designed to represent hierarchical structures; in this case, it shows the structure of the prosodic components of spoken language. The XML-based transcription system proposed here allows the input of 1) the phonetic parameters of F0, intensity and duration of each syllable, their relative variation and standard values to facilitate discrimination and comparison; 2) the distribution of feet; 3) the boundaries and characterization of intonation units and utterances, and 4) other conversational phenomena such as pauses, overlaps, interruptions, etc. This mark-up language is currently being used as an analysis tool for a corpus of digitally-recorded conversations in the Mexican and Iberian vernaculars of spoken Spanish.

1. Introduction

There are three characteristics which distinguish XML [3] from other markup languages (e.g. HTML):
a) Its extensibility: it does not contain a fixed set of tags.
b) Its emphasis on descriptive rather than procedural markup: descriptive markup allows the same document to be processed in a variety of ways by means of style sheets which use only the parts considered relevant, or work the same part of the document for different processes. XML focuses on the meaning of data, not its presentation.
c) Its document type concept: documents are considered to conform to different types, and every document type is formally defined by its constituent parts and their structure. Therefore, documents must be well-formed according to a defined syntax, and may be formally validated.
d) Its independence of any one hardware or software system: every XML document, regardless of the language and writing system employed, uses the same method to encode characters as binary data. This encoding is defined by an international standard known as Unicode, which provides a character set covering most of the past and present writing systems of the world [12].
In XML, constituents are called elements. Each element represents a document’s logical component. A document must describe the logical role of its elements, the abstraction they represent. Elements may contain sub-elements and text, also called character data. Moreover, elements may be specified and described by means of attributes.
There is also another feature in XML which allows the administration of the size and complexity of documents: the external entity. Through the use of external entities, a document is able to track the pieces of bytes composing it [7].
Finally, markup is the medium used to represent the document’s logical structure and the way all physical entities are linked. The general syntax rules of XML markup are given below:
• Markup differs from character data by using special characters called delimiters. Thus, a tag is everything between "<" and ">", or between "&" and ";". [7]
• Tag names are case sensitive. Therefore, <TAG>, <Tag> and <TaG> are interpreted as three different tags [7], [12].

1.1. Marking up human language

The idea of representing human language with a markup language is not new. Some specialist teams, consisting for the most part of computer scientists, are dedicated to making semantic web browsing possible (e.g. [2]), based on the conceptual classification of information, while others are working on speech synthesis (Speech Synthesis Markup Language, SSML, [4]) and voice recognition (Voice Extensible Markup Language, VoiceXML, [8]). Other applications are being developed and used by language scientists, i.e. EXMARaLDA [9], [10] and TEI [12], whose principles are related to some extent to this proposal. However, none of those systems were specifically developed to represent the prosodic structure of language.

2. Structure of prosodic transcripts

In this section, reference is made to the constituents of a transcript containing prosodic information, and the hierarchical structure in which those elements are embedded (see Fig. 1). The root element of this proposal is called <Transcript>. It has one optional element <Header> (see 2.1.) and a required element <Text> (see 2.2.). Their sub-elements and their respective attributes are explained below.

2.1. Header

All sub-elements of <Header>, except for <Participants>, are what are commonly known as nodes since they have no subordinate elements and their content is character data. Elements <Class> and <Acoustic_quality> differ from the rest of nodes because they have attributes. In Fig. 1, required attributes are displayed in regular typeface, and optional attributes in italics. Attribute Type of <Acoustic_quality> has three possible predefined values (A, B or C), while attribute Type1 of <Class> has a character data value. Element <Participants> may have one or more elements <Speaker>, which also have a series of sub-elements, some of which have their own attributes. Element <Header> contains, therefore, the metadata corresponding to the sound file.


Figure 1: Hierarchical structure of prosodic information in transcripts.
2.2. Text

Compared with <Header>, where only metadata is stored, <Text> organizes all phenomena, events, actions, and spoken texts that constitute a conversation. These elements are explained by levels in the following subsections.

2.2.1. Turns

One or more of the element <Turn> are the only direct descendants of <Text>. Each <Turn> may have two attributes: Name and Trans. Attribute Name is required and admits any type of character as its value, which allows the introduction of the identity codes assigned to each participant in the conversation. Trans is optional, since transitions between turns, where a turn is the continuation of a former turn (cont) or turns produced simultaneously (overlap), are considered as exceptions and should be marked by means of those attributes.
The only required descendants of <Turn> are one or more utterances, <U>. All other sub-elements of <Turn> are optional. <Overlap> indicates the beginning and ending points of overlapped (pas) or overlapping (act) productions; cont is used where any of the simultaneous texts extend over more than one turn, utterance or intonation unit, as <Overlap> appears at different levels inside the structure. A reference number or name may be assigned by means of Ref. Element <Unintelligible> marks unintelligible texts in the transcript. Likewise, attribute Type of <Pause> may be specified through the values s, l and xl, while Sec allows the specification of duration in seconds. Finally, <Comment> has optional attributes which allow describing the different types of phenomena intervening during a conversation.


Figure 2: Turn.
2.2.2. Utterances

Each utterance, <U>, requires one or more elements <IU>, i.e. intonation units, and its syntactical category is specified by Type and Subtype. Sub-elements <Overlap> and <Restart> may optionally appear at this level. The values of attribute Type of <Restart> mean repetition (rep), and partial (part) or total (total) restart.


Figure 3: Utterance.
2.2.3. Intonation units

<IU> may be specified by several attributes containing the tonal information from different positions inside the intonation unit: start tone (ST), end tone (ET), nuclear tone (T), and up to five pre-nuclear tones (T1 to T5). The values for these attributes, despite light modifications to avoid characters * or + so as not to interfere with XML syntax, correspond to the Sp-ToBI tone inventory [1], [11], e.g. L.H corresponds to L*+H, Lh. to L+!H*, and lH. to ¡L+H*.
There are also other attributes, like Focus (with positive –y– or negative –n– values) or intermediary tone, IT, which may be high or low. Attributes img and id, are used for the HTML rendition (see 4.).
In fact, all sub-elements of <IU> are optional, so that the structure also allows less detailed XML documents, i.e. without rhythm or syllable structures. New elements at this level are <Interruption>, <Fragment>, and <Border>. The latter takes advantage of the characters at the boundary between each unit, specifies them by means of the attribute Type, and even allows the input of duration with Sec.


Figure 4: Intonation Unit and Foot.
2.2.4. Feet

There are two broken lines from <IU> to <S>, syllable; only one of which runs through <F>, foot. Avoiding <F> would be reasonable when the rhythm structure is not being analyzed.
When including the structure of metrical feet, attribute Wt (weight) could be used: s (strong), w (weak) or 0 for free feet. Upper case is used for strong feet and lower case for weak feet.

2.2.5. Syllables

Between every syllable, <S>, there is a <Break>, whose values also correspond to those of the Sp-ToBI inventory.
Each <S> has a very important series of attributes: phonetic value (Phon); beginning, end, duration, relative variation, and tempo of the syllable (Beg, End, Dur, Durvar, Tmp); its fundamental frequency, with minimum, maximum, relative variation, and standardized value [5] (F0, F0min, F0max, F0var, F0std), as well as its intensity, with minimum, maximum, relative variation, standardized value, and relative volume (dB, dBmin, dBmax, dBvar, dBstd, Vol).
Sub-elements of <S> are: <Elongation>, which marks the positions where a segmental elongation occurs; <Break> with value 0, in order to signal the morphological limits of two syllables in a liaison; <Comment> with attribute rec, as a way of orthographically reconstructing unpronounced segments, and, pointing out a fragmented production with <Fragment>.


Figure 5: Syllable.
2.3. Document Type Definition (DTD)

This model is then translated into a document type definition (DTD), which acts as the validating grammar of all documents that declare themselves pertaining to it.


Figure 6: Document Type Definition.
3. Collecting PRAAT data into XML

3.1. Recording of conversations and basic transcription

The digital recordings adapted to this transcription system belong to a corpus of spontaneous conversations between speakers from Madrid and Mexico City, which represent the standard Spanish and Mexican vernaculars respectively.

Basic transcripts of these conversations enable their management, according to strict transcription criteria. They also provide the first input into PRAAT text grids, which will then be phonetically transcribed. Other tiers could represent, e.g. the rhythmical structure of syllables and their standardized F0 values [5].


Figure 7: PRAAT text grid.
3.2. PRAAT scripting

The reason why PRAAT is used in this process is not just its features for phonetic analysis, but also its facility for creating scripts to automate its own functions.
By using such scripts, images for each intonation unit or utterance are created. These images, showing pitch, spectrogram, and the content of all tiers, are used for the final HTML rendition. Scripts also allow the assignation of variables to the phonetic parameters of each syllable and the looping of this process for every syllable in an intonation unit. The values of the variable may be continuously appended to a text file following an XML-like syntax.


Figure 8: PRAAT script.
3.3. XML document

The text file yielded by such a PRAAT script will be like this:


Figure 9: Resulting XML document.
Since XML documents are purely text files, there is no need of file conversions or adaptations. Possible repetitions, mistakes, or PRAAT character codes (incompatible with Unicode) may be replaced by means of a Visual Basic macro.


Figure 10: Visual Basic macro.
4. Rendering XML in HTML format

The most demanding part of the whole process of creating XML documents is the application of style data, since this requires the knowledge of several computer languages, tools and applications: XSLT, XPath, HTML, CSS, JavaScript, etc.


Figure 11: Style sheet in XSLT.
XML uses a set of mechanisms called XSLT (Extensible Stylesheet Language Transformations [6]). These style sheets do not just format the text to be rendered in HTML format, but also process the structure and information of elements and attributes in order to introduce, for example:
1) a table with the contents of the header, if it exists;
2) a separate table with the speakers’ information;
3) a summary, where all prosodic phenomena encoded in the document are counted up, and
4) a list of conventions used throughout the text.
Moreover, by means of the style sheet, certain features may be added in order to insert a hyperlink at the beginning of each intonation unit pointing to an analysis window, where an image showing, e.g., pitch, spectrogram and text tiers, and the corresponding sound file may be seen and listened to.
It is at this point that the img and id attributes of element <IU> are called upon, since img defines which button is displayed at the beginning of the intonation unit, and id provides the identity codes linking the linguistic production with the image and sound files. In the case that the button specifies that there is an analysis window linked to a particular intonation unit, the browser runs a JavaScript program, giving instructions to open another browser window displaying a web page containing the image and a link to the sound file.
Last but not least, it is necessary to point out that the ultimate aim of XML is not merely to render HTML pages, but to enrich them with a hierarchical structure and self-descriptive information. This is particularly useful in the case of <S>, whose attributes contain important phonetic data ready to use. Here lies the most important difference between XML and HTML: with XML all this information remains stored and may be recalled at any time.
Recalling the information may be done dynamically, for example, making each syllable in the text react when the cursor passes over them by turning red and displaying a yellow label with the most important phonetic information. If the syllable is clicked, a dialog window is opened displaying all values of the attributes of <S>. The final result on a web browser window appears as shown in Fig. 12. In order to test this demonstration in real time, refer to:

<http://www.geocities.com/utka/df1-04-1.html>

Figure 12: HTML rendition of an XML document.

5. Conclusions

In this paper, I have presented an XML-based transcription system designed to be applied as a tool to study several prosodic phenomena in spoken conversations pertaining to the Spanish vernaculars of Madrid and Mexico City.
This process begins with the basic transcription of recordings of interactive communications, where the relevant prosodic phenomena are minimally marked. Such information constitutes the most primary input source when segmenting the sound file in PRAAT. Other analyses are aggregated to the text grid as different tiers. The phonetic and textual information corresponding to each analyzed segment (turn, utterance, intonation unit or syllable) are then extracted and written into a text file with an XML-like syntax. Once the resulting XML document is free of errors, well-formed and valid according to its document type definition, it is processed and formatted by means of a style sheet, which yields an enriched and dynamic HTML document.
The resulting document then becomes something very different from the commonly-held conception of text or web page. Not only is its appearance important, but its treatment of data is rendered in such a way that it facilitates the analysis of spoken corpora through the coordination and combination of text, databases, sound files and images, offering a very powerful but easy-to-use platform.

6. References

[1] Beckman, M.E.; Díaz-Campos, M.; Tevis McGory, J.; Morgan, T.A., 2002. Intonation across Spanish in the Tones and Break Indices framework. Probus 14: 9-36.
[2] Berners-Lee, T.; Hendler, J.; Lassila, O., 2001. The Semantic Web. Scientific American: 17/05/2001 [http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21]
[3] Bray, T.; Paoli, J.; Sperberg-McQueen, C.M. ; Maler, E.; Yergeau, F. (ed.), 2004. Extensible Markup Language (XML) 1.0 Second Edition. W3C Recommendation: 04/02/2000 [http://www.w3.org/TR/REC-xml]
[4] Burnett, D.C.; Walker, M.R.; Hunt, A. (ed.), 2004. Speech Synthesis Markup Language (SSML) Version 1.0. W3C Rec.: 07/09/2004 [http://www.w3.org/TR/ 2004/REC-speech-synthesis-20040907/]
[5] Cantero, F.J., 2002. Teoría y análisis de la entonación. Barcelona: Universitat de Barcelona.
[6] Clark, J. (ed.), 1999. XSL Transformations (XSLT) Version 1.0. W3C Rec.: 16/11/1999 [http://www.w3.org/ TR/1999/REC-xslt-19991116]
[7] Goldfarb, C.F.; Prescod, P., 2000. The XML Handbook, 2nd Ed. London et al.: Prentice Hall.
[8] McGlashan, S.; Burnett, D.C.; Carter, J.; Danielsen, P.; Ferrans, J.; Hunt, A.; Lucas, B.; Porter, B.; Rehor, K.; Tryphonas, S., 2004. Voice Extensible Markup Language (VoiceXML) Version 2.0. W3C Rec.: 16/03/2004. [http:// www.w3.org/TR/2004/REC-voicexml20-20040316/]
[9] Schmidt, T., 2002. EXMARaLDA - ein System zur Diskurstranskription auf dem Computer. Arbeiten zur Mehrsprachigkeit, B (34), Hamburg. [http://www.rrz.uni-hamburg.de/exmaralda/Daten/4D-Literatur/AZM.pdf]
[10] Schmidt, T., 2005. Time-based data models and the Text Encoding Initiative's guidelines for transcription of speech. Arbeiten zur Mehrsprachigkeit, B (62), Hamburg. [http://www.rrz.uni-hamburg.de/exmaralda/Daten/4D-Literatur/SFB_AzM62.pdf]
[11] Sosa, J.M., 2003. La notación tonal del español en el modelo Sp-ToBI. In Teorías de la entonación, P. Prieto (ed.). Barcelona: Ariel, 185-208.
[12] Sperberg-McQueen, C.M.; Burnard, L. (ed.), 2004. TEI P5. Guidelines for Electronic Text Encoding and Interchange. The TEI Consortium. [http://www.tei-c.org/P5/Guidelines/]