Four-Features Evaluation of Text to Speech Systems for Three Social Robots

: The success of social robotics is directly linked to their ability of interacting with people. Humans possess verbal and non-verbal communication skills, and, therefore, both are essential for social robots to get a natural human–robot interaction. This work focuses on the ﬁrst of them since the majority of social robots implement an interaction system endowed with verbal capacities. In order to do this implementation, we must equip social robots with an artiﬁcial voice system. In robotics, a Text to Speech (TTS) system is the most common speech synthesizer technique. The performance of a speech synthesizer is mainly evaluated by its similarity to the human voice in relation to its intelligibility and expressiveness. In this paper, we present a comparative study of eight off-the-shelf TTS systems used in social robots. In order to carry out the study, 125 participants evaluated the performance of the following TTS systems: Google , Microsoft , Ivona , Loquendo , Espeak , Pico , AT&T , and Nuance . The evaluation was performed after observing videos where a social robot communicates verbally using one TTS system. The participants completed a questionnaire to rate each TTS system in relation to four features: intelligibility , expressiveness , artiﬁciality , and suitability . In this study, four research questions were posed to determine whether it is possible to present a ranking of TTS systems in relation to each evaluated feature, or, on the contrary, there are no signiﬁcant differences between them. Our study shows that participants found differences between the TTS systems evaluated in terms of intelligibility, expressiveness, and artiﬁciality. The experiments also indicated that there was a relationship between the physical appearance of the robots (embodiment) and the suitability of TTS systems.


Introduction
Social robots are intended to "live" around humans to help and/or entertain them. In this regard, the speech is probably the richest and the preferred way for humans to communicate, making the software that allows the robot to generate an artificial voice a crucial element during human-robot interaction. These systems, commonly known as Text To Speech (TTS) systems, can convert text to artificial voice.
There are several definitions of what a TTS system is. Van Bezooijen defines it as a system that 'allows the generation of novel (oral) messages, either from scratch (i.e., entirely by rule) or by recombining shorter pre-stored units' [1]. On the other hand, Handley uses the following definition: 'Speech synthesis systems, or speech synthesizers, are computer programs which automatically generate speech, i.e., systems which enable the computer to "talk" or "speak" to the user' [2]. Since the

Related Work
In this section, the most important TTS systems that are currently available are presented. Moreover, we also highlight those systems that are used in social robots and other electronic devices. After that, we review the previous literature of comparative studies of different TTS systems and the features that are evaluated.

Relevant TTS Systems in the Market
There are several TTS systems that are currently available for use, but in this section we are going to describe some of them. There are several references where the most relevant TTS systems are listed [5,6]. Some of them are the following: • Mbrola is an open source artificial voice generation system which allows, at a low level, a great degree of control over the synthesized speech. In this sense, the user can configure various parameters to get precise prosodic control [7]. • Loquendo TTS synthesizes a human-like voice in multiple languages. It became very popular on Internet platforms, like Youtube, since the community employed it to generate tutorials and parodies.

•
Pico is a TTS system developed by SVOX that currently is installed by default on most Android devices (at least until the 4.2 version). Note that SVOX and Loquendo were acquired by Nuance in 2011.

•
Nuance Real Speak is the flagship product, regarding speech synthesis system, of the Nuance Communications company. It allows generating voices in several languages and it is the official voice of the virtual assistant of Apple, Siri.
• Festival is a general multilingual speech synthesis system originally developed at the Centre for Speech Technology Research (University of Edinburgh). It is distributed under a free software license similar to BSD. • Ivona: This is the TTS system developed by the Amazon company. It is widely used in Amazon devices, such as the Kindle electronic reader.

•
Google: This is the voice system developed by Google and is used in its applications, web services, and in its virtual assistant ' Google Now'. The generated voice is different in each language (supports over 80 languages and dialects).

•
Microsoft: This is the voice system of the Microsoft company, and it is used in its services, applications, operative systems, and in its virtual assistant 'Cortana'. • AT&T: This is the system developed by the AT&T company. It generates speech in eight different languages, and it is used in its call centres. • Verbio: This is the system developed by the Spanish company Verbio. It is mainly used by call-centre services of companies and public institutions.

The TTS Systems Used in Social Robots and Other Electronic Devices
In social robotics, not all the robots communicate using synthesized speech. In fact, some of them do not use any sound to express themselves, such as the robot Keepon [8]. On the other hand, there is another group of social robots that express their internal state, but just using non-verbal sounds. This is the case of Paro [9], a baby seal robot that plays sounds similar to the ones that a real baby seal emits. Another example of a social robot with non-verbal communication skills is the robotic dog Aibo [10], developed by Sony. Using some pre-generated sounds, Aibo can share with the user some internal states such as 'happy', 'sad', etc. In addition, it can also communicate different events such as a user detection, an internal failure, etc.
Finally, there is another group of social robots that has verbal communication skills. For example, the Nao robot [11], developed by Aldebaran, is a humanoid robot that is able to interact with people in a natural way by voice and gestures. This little humanoid (it is about 58 cm tall) has become one of the research platforms most used by the robotic community for HRI (Human-Robot Interaction). This robot uses the TTS web service of Nuance. The same company has recently developed a new social robot called Pepper (more details in [12]). This is another humanoid robot, but bigger than Nao (about 120 cm tall). Pepper, like any other social robot, has been created to interact with humans using natural modes of interaction such as voice, gestures, nonverbal sounds, touch, etc. Moreover, Pepper includes another additional input and output channel: a tablet placed on its chest. As Nao, Pepper also uses Nuance as TTS.
Another social robot with verbal communication skills is iCub [13,14]. Its appearance is similar to a two-year old children (about 1 m tall), and it is used as a research platform to test learning algorithms, cognitive skills, and artificial intelligence algorithms. In this case, the TTS system used is Acapela [15]. The robot Jibo is another kind of social robot that has been recently developed. The company's founder, Cynthia Breazeal, describes Jibo 'as the result of R2D2 and Siri having a baby,' that is, it is a robot that is endowed with verbal and non-verbal communication skills. Jibo supports text-to-speech markup; this allows selecting which parts of the synthesized text should be given emphasis, or how unusual words or names should be pronounced. This feature is particularly important given the emphasis on the robot having its own specific personality. The Yamaha Vocaloid Humanoid Robot uses Vocaloid [16][17][18]. This system is different because it is used for talking and singing. Therefore, it allows creating very realistic artificial singing voices.
In relation to electronic devices, nowadays, there is a new generation of 'intelligent' devices (smartphones, tablets, smartspeakers, smartwatches...) that are equipped with a virtual assistants. The user interacts with this assistant by voice. In the case of the Apple Inc. devices, iPad, iPhone, Watch, etc., they have Siri [19], which uses Nuance for voice recognition and speech synthesis tasks. Android-based devices have another assistant, called Google Assistant [20], which uses the Google TTS technology. Moreover, the Amazon devices, such as the Kindle [21] or Echo (https://www.amazon. es/Amazon-Echo-Altavoz-Inteligente-Alexa), use "under the hood" Ivona. Finally, there are other electronic devices, developed by Microsoft, which are equipped with the voice assistant Cortana [22] that uses Microsoft TTS.

Previous Comparative Studies of TTS Systems
The literature offers few comparative studies of the performance of TTS systems. In addition, the related references are not very recent, so perhaps some of the evaluated TTS systems are now discontinued. These studies have aimed to apply these results to improve the Interactive Voice Response applications (IVRs) that are used on "call centres". Nevertheless, in this section, we briefly present these papers.
In 2006, Roehling presented a comparative table of 12 TTS systems: BabTTS, Natural Voices, DECtalk, Naxpres, Loquendo, Mulan, Speech SDK, RealSpeak, Festival, Gnuspeech, OpenMary, and ProSynth [23]. This analysis studied the different features of the TTS systems needed for synthesizing expressive speech, considering pitch, duration, loudness, and voice quality. They concluded that OpenMary was the best solution to endow the robot B21r with an affective speech.
More recently, in 2009, Handley [2] presented a study focusing on the requirements of a TTS system to be used in Computer-Assisted Language Learning applications. These systems have control over the characteristics of the generated speech, that is, different styles (formal or familiar), different communicative rhythms (the speech rate), different tones of voice (timbre), and different ways of expression (interrogative, enunciative, imperative, exhortative, and exclamative). This study analyzed four TTS systems: AT&T , Nuance Vocalizer (the predecessor software of Nuance NaturalSpeak), eLite, and Acapela BrightSpeech [15]. The participants listened to the different systems using a PC and, after that, they completed an online questionnaire to assess their adequacy, acceptability, and intelligibility. Then, the average scores obtained for each TTS system and for each feature were presented. On average, the top-rated TTS system was Acapela BrightSpeech.
More recent papers focus on comparative studies of TTS systems for non-Latin languages, such as Arabic and Hindi. Research in this area has so far been mainly confined to English and other European languages (Spanish, German, French, Italian, etc.). For the Arabic [24] and Indian languages [25], such tools are still in their infancy, and the TTS systems developed are mainly used to help visually impaired people.

Evaluated Features of the TTS Systems
In order to determine the performance of a TTS system, it is necessary to define its characteristics and the way to evaluate them. In this sense, some authors have stated a formal definition of the features of a TTS system. According to Francis [26], the most important features of a TTS system are intelligibility and naturalness. First, he defines intelligibility as the ease of users' understanding the speech generated by humans or machines. Then, he defines a natural conversation (naturalness) as a speech that sounds as if it had been produced by a native speaker.
Handley [27] suggests that the quality of speech generated by a TTS system during Human-Computer Interaction (HCI) should be as comprehensible, natural, and accurate as possible. Later, as previously presented, Handley, in [2], presents a comparative study of different TTS systems and evaluates them in relation to the following features:

1.
Adequacy: 'is the speech adequate for use as a reading machine (in comparison with other media)?' 2.
Acceptability: 'is the speech acceptable for use as a reading machine (when is not possible to use other media)?' 3.
Comprehensibility: 'is the message easy to understand?' 4.
Intelligibility: 'are the individual phonemes/sounds and words easy to recognize (and discriminate one from another)?'
Precision of phonemes: 'was the articulation of the phonemes/sounds precise?' 7.
Appropriateness of prosody: 'was the prosody (music) of the utterance appropriate?' 8.
Appropriateness of register: 'was the register appropriate?' On the other hand, the International Telecommunication Union (ITU-T), in 1994, set a questionnaire to evaluate TTS systems in voice applications (call centres) [28]. This questionnaire used the Mean Opinion Score (MOS) [29], and the evaluated features were the following: 1.
Sound quality acceptance: related to the quality of the sound. This requires a yes or no answer.

2.
Listening effort: related to the effort required to understand the message.

3.
Comprehension problems: related to the difficulties to understand certain words.

4.
Articulation: related to the question about if the sounds were distinguishable.

5.
Pronunciation: related to the possible anomalies detected in pronunciation.

6.
Speaking rate: related to the average speed of delivery.

7.
Pleasantness: related to the pleasantness of the voice. 8.
Overall impression.
The user evaluates each feature using a score from 1 to 5 (five-point Likert Scale), 5 being the most positive (except for sound quality acceptance, which required a yes/no answer). Other studies have been carried out using this MOS scale, or a modified version. This is the case of the research presented by Viswanathan in [30]. He uses an extended version of the MOS scale, and he concludes that the most important features to evaluate a TTS system are intelligibility and naturalness. Each of these concepts, according to that author, includes other features of a TTS system. That is, naturalness includes: naturalness, ease of listening, pleasantness, and audio flow; on the other hand, intelligibility includes: listening effort, pronunciation, comprehension, articulation, and speaking rate. More recently, in 2014, King [31] performs a review of the improvements obtained in the TTS technologies during the last decade. Again, he claims that the evaluations of naturalness and intelligibility are the main evaluation criteria for determining the quality of the speech synthesis. For social robots, Alonso [32] defines the naturalness of the generated speech as its degree of similarity with that emitted by a human, while the intelligibility is defined as the ease of the user's understanding the message generated by the robot. For that author, these two features are the most important ones during HRI.

Experiment
As already stated, in this paper, we present a comparative study of the performance of several TTS systems to be used in social robots. In order to carry out this study, some of the TTS systems that are currently available, described in Section 2.1, were integrated in our social robots, introduced in this section, particularly in Section 3.2. By means of a questionnaire, the participants evaluated them by rating their features.

The Compared Text-To-Speech Systems
The social robots used to carry out this comparative study have an interaction system known as the "Robotic Dialog System", or just RDS, presented in [33]. The RDS gives to these robots the capacity to interact with humans, especially using multimodal speech dialogs. In this study, we have implemented and used the component called 'Text-To-Speech'. This component allows our social robots to communicate with the users using different kinds of voice, language, volume, etc. In addition, it integrates the eight TTS systems that are analyzed in this paper: Loquendo (v7.7) 7.
The first five TTS systems require an Internet connection (since they use web services), while the last three do not require a persistent connection.
We have selected these eight TTS systems based on three main requirements: (i) the system should be used in different domains, paying special attention to developments integrated by the robotics research community; (ii) the software should be open source or, at least, it should offer a trial version, and (iii) it should support the Spanish language with acceptable technical support. Thus, Festival was not selected since it does not offer robust Spanish support, and Verbio was discarded since it does not offer a trial version. It should be noted that the selected TTS systems (except for Loquendo) cannot be customized, that is, they offer just one version. In the case of Loquendo, we use its default speech in order to make all results in this study comparable.    The robots integrate a dialog mechanism to enable natural HRI. For this reason, selecting the most adequate TTS system is crucial to enhance the user experience. Apart from the dialog system, the robots include high-quality speakers, microphones, and sound cards. The first robot, Maggie, is able to move through the environment to interact with people. The robot was originally designed as a generic research platform to test interaction mechanisms to improve the HRI experience. Maggie can communicate through sounds, gestures, and a touch-screen mounted in its chest. The robot has a rigid plastic shell and is 1.40 m tall. Mini is a desktop version of Maggie, also developed by the RoboticsLab, that acts as a companion for elderly people. In contrast to Maggie, Minnie is shorter, (just 55 cm) and is covered in a plush-like soft fabric and integrates the same HRI capabilities as Maggie, with an external tablet to enhance interaction. Finally, Mbot is another mobile platform developed in the EU project MOnarCH [38]. The robot is 1.15 m tall, that is, like the height of an 8-11 year-old child, since this social platform was designed to interact with children at the pediatric ward of the Portuguese Oncology Institute in Lisbon (Portugal). Similarly to Maggie, Mbot's shell is of a rigid material, carbon fibre.

Procedure
As seen in Section 2, in order to determine the performance of a TTS system, several authors have proposed different sets of characteristics to be evaluated. In the present paper, considering these references, especially the ones presented by Handley [2] and Viswanathan [30], the performance of each TTS system is determined by the evaluation, using questionnaires, of the following features: Each of these questions have been rated using a Likert 5-point scale. In the case of expressiveness, the ranking varies between 'very monotonous' (1) and 'very expressive' (5). For the other features, a lower number of points corresponds to 'Not at all' while the maximum one is for 'Yes, absolutely.' As can be observed, in addition to intelligibility (known as comprehensibility by Handley) and naturalness (also known as expressiveness by Handley), considering our target scenarios, TTS systems in Human-Robot interaction, and more specifically in social robotics, we have included two other important features: artificiality, related to the metallic/robotic sound of the voice, and suitability, related to the perception that the user has of whether the voice suits the robot considering its external appearance. The evaluation of these characteristics, as also stated in [32], is important in this kind of comparative study.
The questionnaires were created using the web tool 'Google Forms' [39]. The first page of the questionnaire is an introductory page where the user has to read some instructions about how to fill it in, and to answer some personal questions: age, gender, and educational level (university or non-university studies). The main part of the questionnaire is divided into eight pages, each one associated with a TTS system. The order of the pages was randomized when the forms were created. Every page shows a short video where the robot is talking using a specific TTS system. The robot says the following sentence in Spanish: 'This is the robot X and this is a test sentence to evaluate the TTS system Y.' After hearing this sentence, the user must score the four questions, and then the next page appears, showing the same robot using a different TTS system. These questionnaires were distributed publicly for a month through the Internet using social networks in order to try to obtain the maximum diffusion. Each user was only allowed to fill out one questionnaire, so the user evaluated the performance of the eight TTS systems for just one robot. This assignment was made by the researchers, so the user did not know about the existence of the other robots, trying to balance the number of participants per robot/questionnaire type.

Research Questions
These questionnaires had two goals. The first one was to verify the following questions: 1.
RQ1: are all TTS systems equally well understood? 2.
RQ2: do all TTS systems have the same expressiveness? 3.
RQ3: are all TTS systems equally perceived as robotic? 4.
RQ4: are all TTS systems equally suitable for each robot?
In case the results confirm these RQs, then the second goal was to rank the TTS systems considering the features evaluated.

Participants
For this study, we obtained 125 questionnaires in all (for the three robots). The distribution among the robots is the following: 44 questionnaires for Maggie (35.2%), 42 for Mini (33.6%), and 39 for Mbot (31.2%).
Regarding their age, participants were grouped into three categories: 17-30 years, with 33 participants (26.4%); 31-40 years, with 86 participants (68.8%); and more than 41 years, with six participants (4.8%). Most of the participants were males (94 participants, which means 75.2% of the participants) and just 31 participants were females (24.8%). Finally, regarding the educational level, 24 participants (19.2%) say that they have carried out only non-university studies (just primary or secondary), while the majority of the participants (101, 80.8%) declare that they have carried out university studies (a bachelor's degree, masters, or PhD).

Results
This section introduces a thorough analysis of the questionnaires, grouping the results regarding the research questions presented in Section 3.4. The software used in the statistical analysis of the results was IBM SPSS [40].
In our analysis, we considered the scores given to each TTS system, our independent variables, considering all the research questions (features), our dependent measures: the mean and the standard deviation values were calculated and are presented in the next sections. We also had to prove that the differences between the mean values were significant for each TTS in relation to each dependent measure using one-way repeated measures ANOVA. After proving a statistically significant result from the above analyses, we could select which TTS systems differ from one another. This information was provided in the Pairwise Comparison tables, presented in Appendix A.

Intelligibility: Are All TTS Systems Equally Well Understood?
This first feature evaluates if the voice is clearly understood. Considering the results of the multivariate test, Wilks' Lambda (WL), there are significant differences between the TTS systems, W L = 0.101, F(7, 118) = 149.89, p < 0.001. In Figure A1 (see the Appendix A), the pairwise comparison table is presented. Therefore, we can say that the answer to RQ1 is that not all the TTS systems are equally well understood. This answer allows ranking the TTS systems by representing the results in the order in which the TTS system with the highest mean value is situated first (at the left of the figure) and the one with the lowest mean value appears at the last position (at the right of the figure); see Figure 4. The ranking shows that, in terms of the intelligibility, the best-synthesized voice corresponds to Google. Ivona TTS also receives a good score. In fact, there is no significant difference with Google: p = 0.228. We can identify a second group significantly different from the previous ones. This is composed by Loquendo, Nuance, Microsoft, and Pico. The study shows that the intelligibility of AT&T and Espeak is noticeably worse.

Expressiveness: Do All TTS Systems Have the Same Expressiveness?
This feature expresses how monotonous or expressive users perceive the synthetic voice generated by the TTS system. Again, we analyze the results provided by the ANOVA test. In this case, W L = 0.25, F(7, 118) = 49.56, p < 0.001, which means that the different TTS systems differ in expressiveness. For this reason, we can say that not all TTS systems have the same expressiveness (RQ2). The pairwise comparison table is presented in Figure A2; see Appendix A.
As in the previous feature, we can use the means and standard deviation to rank the systems evaluated regarding their expressiveness.
Again, Google TTS stands out, being perceived as the most expressive system, (p < 0.05) (see Figure 5). After Google, we find Loquendo, Ivona, Microsoft, and Nuance with no significant differences, p = 1, among them in terms of expressiveness. Pico, AT&T, and Espeak are perceived as the least expressive systems.

Artificiality: Are All TTS Systems Equally Perceived as Robotic?
Considering artificiality, the aim is to analyze how "robotic" the participants perceive the robot's voice. By robotics, we consider how not human-like or metallic the voice sounds. The results from the multivariate test, Wilks' Lambda, show significant differences between the TTS systems, W L = 0.37, F(7, 118) = 28.17, p < 0.001.
Given these results, the answer to the RQ3 is that not all the TTS systems are equally perceived as "robotic". Figure A3 (Appendix A) presents the pairwise comparison table for this feature.
The results show that Espeak was perceived as the most artificial TTS system, with a significant difference with respect to the other systems evaluated. Figure 6 shows the ranking regarding Artificiality where, after Espeak, the systems are sorted as follows: AT&T, Loquendo, Pico, Microsoft, Nuance, Ivona, and Google. In contrast to intelligibility and expressiveness features, there is no clear set differentiation among the TTS systems as they all present similarities (p > 0.05) with their neighboring ranked ones. In any case, Google is perceived as the most natural TTS system showing that there is a correlation between the features analyzed in this work.

Suitability: Are All TTS Systems Equally Suitable for Each Robot?
This feature tries to investigate which TTS system is perceived as the most suitable for each of the three different social robots presented in Section 3.2. This research question is considered for each robot separately: Therefore, a one-way repeated measures ANOVA is conducted, using the scores obtained for each robot, to determine whether there are significant differences between the TTS systems in terms of their suitability for a specific robot.

Maggie
According to the results obtained for Maggie-Wilks' Lambda = 0.52, F(7, 116) = 15.61, p < 0.001-there are significant differences between the TTS systems. For this reason, we can say that not all the TTS systems are equally suitable for Maggie. Table 1 shows the descriptive statistics and Figure A4 presents the pairwise comparison table. In this figure, it can be observed that the most suitable one is Google although Ivona, Loquendo, and Nuance obtain similar results, p > 0.112. On the other hand, the worst evaluated TTS systems, being significantly different from Google (p < 0.05), are Espeak and Pico (see Figure 7).

Mbot
In the case of Mbot, the results of the multivatiate test, W L = 0.73, F(7, 116) = 6.22, p < 0.001, also confirm that not all the TTS systems are perceived as equally suitable for Mbot. Table 2 presents the values of the mean and the standard deviation, and the pairwise comparison table is shown in Figure A5. For this robot, the favourite one is Ivona, with Microsoft and Google the second and the third best evaluated TTS systems. These three systems obtained similar results, p = 1. On the opposite side, AT&T and Loquendo are the TTS systems considered as significantly not well-suited for this robot (p < 0.05), in comparison to Ivona, since they were the worst evaluated ones (see Figure 7).

Mini
Finally, for Mini, W L = 0.64, F(7, 116) = 9.21, p < 0.001, so, again, there are significant differences between the TTS systems in terms of their suitability for this robot. The descriptive statistics are presented in Table 3. According to these results, it seems that there is no clear 'winner' on this occasion. In the pairwise comparison table, Figure A6, it is observed that just one TTS system, Espeak, is significantly different from the rest of the systems except for AT&T. These two systems are perceived as the least suitable for Mini, so, although we cannot affirm that all the TTS systems are equally suitable for this robot, there are no significant differences between the other ones (p = 1). This means that there are six TTS systems equally suitable for Mini.
In Figure 7, we can observe that, as has been said, although the preferred TTS system is Loquendo, the majority of the TTS systems obtained similar results: there are no significant differences between the TTS systems except for Espeak and AT&T, which were the worst evaluated ones.

Correlations between the Four Features Analyzed
To complete this study, we intended to analyze the correlations between the four features using the Pearson product-moment correlation coefficient. To do so, we performed a preliminary analysis to prove the conditions of normality, linearity and homoscedasticity. The test showed a strong positive correlation between three of the features: intelligibility, expressiveness, and suitability (r > 0.476, p < 0.01). Additionally, there is a negative correlation between the previous features and artificiality (r < −0.225, p < 0.01) as shown in Table 4. It means that those questions related to intelligibility, expressiveness, and suitability are directly correlated. The cause could correspond to the following reasons: (i) all questions are related to the same feature or, at least, this is what participants have perceived; or (ii) there is a real relation between the analyzed features. In our opinion, this could be the actual cause. Considering the second assumption, we can infer that, if a TTS system is perceived as intelligible, it will also be perceived as expressive and, consequently, these systems will tend to be preferred for a social robot.

Discussion and Conclusions
In this work, we have presented a comparison of eight TTS systems considering four features: intelligibility, how clear the voice of the robot is; expressiveness, how monotonous the voice is; artificiality, how "robotic" the robot voice is; and suitability, how adequate the voice is for a robot. The first two features are usually included in these kinds of studies as the aspects to be optimized. Additionally, we have included the last two, since, in social robotics, it is important to analyze how natural and suitable for the robot the voice is perceived. The tests have been carried out after integrating these systems into three social robots.
In total, 125 participants evaluated all features for each TTS system, but each participant just considered one of the social robots. After that, we conducted a statistical analysis to see if there were significant differences in the results obtained by each TTS. The method used was a one-way repeated measure ANOVA. Regarding RQ1, RQ2, and RQ3, the statistical analysis shows that there are differences in terms of intelligibility, expressiveness, and artificiality for the TTS systems. This allows establishing a comparison between the systems, indicating which one is the most and least intelligible, expressive, and artificial. Moreover, the analysis indicates that a direct correlation exists between the features intelligibility and expressiveness and an inverse correlation between these ones and artificiality.
In general, the TTS system provided by Google is the best rated one with respect to intelligibility and expressiveness, being perceived as the least artificial. Finally, Espeak is at the end of the ranking, with user perception of being robotic, monotonous, and not clear.
In relation to RQ4, we observe that, although for each robot there are significant differences between the TTS systems, we cannot conclude that there is just one most suitable TTS system for each robot. In fact, there is a set of TTS systems preferred for each robot-for Maggie: Google, Ivona, Nuance, and Loquendo; for Mbot: Ivona, Microsoft, and Google; for Mini: Loquendo, Ivona, Pico, Google, Microsoft, and Nuance.
Considering the results obtained for this feature, we can make the following observations: • For our three social robots, the most suitable TTS systems overall are Google and Ivona. In fact, Ivona has been, in all cases, the second best rated (with no significant differences from the first and the third ones). Therefore, this TTS system can be a good selection for these robots.

•
In relation to the less suitable TTS systems, it is interesting to note that, for Maggie and Mini, the worst evaluated system is Espeak (it is significantly different from the most suitable one (p < 0.05)). On the contrary, this TTS system is not perceived as the least suitable one for Mbot.
One reason could be that Maggie and Mini have more physical similarities between them (Mini is a small version of Maggie) than with Mbot. Another reason could be related to gender issues. One aspect about the TTS systems that has not been considered until now is the gender of the synthesized voice. This characteristic may seem to be not very relevant at first, but, considering that we give names to the robots, which people can associate with the feminine or masculine gender, this feature must be considered in order to evaluate the suitability of a particular voice with a specific robot. All TTS systems have been tested using a feminine voice except for Espeak, which uses a masculine voice. According to our own experience, people tend to refer to Maggie and Mini as feminine, and to Mbot as masculine. Therefore, it is logical that this TTS system is perceived as less suitable for Maggie and Mini, and not so unsuitable for Mbot.

•
In general, the TTS systems that are evaluated as the most 'robotic' ones (Espeak, AT&T, and Pico) are also considered as less suitable for the robots. This seems to be a contradiction, but, it must be noted that these TTS systems are also the ones that were evaluated as the less clearly understood by the participants (intelligibility).

Limitations and Lessons Learned
The work presented in this paper has some limitations. First of all, the validity of the analysis might be influenced by the language used in the experiments: Spanish. Although this may not be a limitation per se, we limited the study to TTS system that offered that specific language. Therefore, we have missed other interesting TTS systems.
Another limitation, also related to the selection process, is that another reason to choose these eight systems was their price. As in the previous point, this may cause some good TTS systems (maybe better than the ones considered in this paper) to have been discarded.
In relation to the suitability feature, just three social robots were used, and, moreover, they may have some resemblance to each other: all of them have a head, eyes, similar colors, etc. This fact may explain the results and conclusions obtained in RQ4: although there are some TTS systems clearly not suitable for the robots, when selecting the most suitable one, we do not have a clear winner for each robot.
It should be noted that the participants filled the questionnaires after watching a video of the robots speaking instead of directly interacting with the robots. This limitation presented an important advantage to this study, allowing for reaching a broader number of participants. We are aware that some bias may have been introduced due to this limitation associated with the lack of interaction. In addition, the sounds registered may have been affected by some constraints such as our microphones when recording the utterances, the audio encoding in the recordings, the recording distance and position with respect to the robot, and the sound equipment of the participants. In this regard, we acknowledge that using videos for the evaluation could have introduced some bias due to the lack of direct interactions with the robot and the system chosen for reproducing the sounds. The quality of the voice perceived by participants could be affected by some aspects as the microphone used to collect and record the audio; the audio codec used in the video; the distance and position regarding the robot; and the sound equipment used by the volunteers. For the first limitations, we made an effort to make sure that the recordings were made from the same position with respect to the robots and TTS systems and with a high-quality recording system. In addition, the sound system used by the participants in the experiments was an aspect in which we had no control.
Finally, another factor that should be taken into account is that the name of the TTS system is said in the videos. Although Google has a very good performance objectively, maybe participants were influenced by the name, since it is a well known name product (authority bias). In this sense, the order in which each user listens to the utterance could also be affected by the comparison bias since the users evaluating TTS systems for each robot have heard the utterances in the same order.

Funding:
The research leading to these results has received funding from the projects: "Development of social robots to help seniors with cognitive impairment (ROBSEN)", funded by the Ministerio de Economia y Competitividad; "RoboCity2030-DIH-CM", funded by Comunidad de Madrid and co-funded by Structural Funds of the EU; "Robots Sociales para estimulación física, cognitiva y afectiva de mayores (ROSES)" funded by Agencia Estatal de Investigación (AEI) Figure A2. Pairwise comparisons for Expressiveness. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma. Figure A3. Pairwise comparisons for Artificiality. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma. Figure A4. Pairwise comparisons for Suitability and the robot Maggie. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma. Figure A5. Pairwise comparisons for Suitability and the robot Mbot. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma. Figure A6. Pairwise comparisons for Suitability and the robot Mini. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma.