Let me start off by saying I absolutely enjoyed trying Say2Play with the 7-Day Trial. Let me follow that up by saying you will most definitely NOT see me buying this product, for the following two reasons:
- I'm British. Well, actually, I'm Dutch, but I'm a university student of English English, so my pronunciation is comparable to British RP. Opening Bag 4 is a hassle, as I have to pronounce the R at the end of the word in order for the VR software to pick up my command properly. Closing any panel usually only yields the proper result on account of its command being the same as the command for opening that panel, except with "close" in front of it, and both opening and closing a panel requiring the same keyboard input. Apparently, I have severe difficulties pronouncing the particular American vowel unless I really focus on its pronunciation (which does not come natural at all, I might add). Concluding my session with a hideous "largawt" instead of the proper "logout" was nigh blasphemous.
- The elongated glottal stops required between seperate words are very unnatural and require too much thought and focus on the pronunciation. "Close skills" should never have to be uttered with two /s/ sounds seperated by a glottal stop.
Incidentally, I am also a university student of Linguistics, who might just have a solution to aforementioned problems. It's sort of like two birds, one stone and it's -- as far as I'm aware -- a very managable solution. Hear me out, if you will:
As my commands are only recognised when pronounced in American English, there is obviously some underlying form of phoneme recognition present. For that, there also has to be phoneme segmentation, so I'm assuming that's in there as well. If there is phoneme segmentation, there is very likely also a quick analysis of F0, F1, and F2, in order to determine phoneme boundaries. This analysis is then probably compared with a predefined set of possible phonemes, words, or even full phrases, due to the fact that English spelling is by no means a solid indication of its pronunciation yet I
can make new phrases and even words.
With this in mind, it shouldn't be all that hard to adjust the system to an English English phoneme set. But why stop there? At one point, some speaker of Kentish, Manx, Spanish English, Afro-American English, or Texan comes along, and runs into the same problem.
Solution: Voice recording. Let's say that, whenever a user wants to, he can not only test the list of commands (as is currently possible), but his tests also influence the predefined set of phonemes, words, and phrases. As phoneme recognition, phoneme segmentation, and F0 to F2 analyses are present anyway, why do we not use them to create new templates to describe the user's pronunciation (or perhaps even only where they notably diverge from what's already there) and replace the old templates? This will solve both problems described above, as (1) British or American (or any other dialect) makes no difference to recognition anymore due to the fact that the British can simply save their pronunciation, whereas the Americans can keep theirs, too, and (2) seperate words can simply be saved as phrases. Why, after all, would voice recognition software that does not intend to reproduce spoken words into speech have to be able to seperate spoken words? Treat "Bag 5" as a single word, /bAgfaiv/ instead of /bAg ? faiv/, and there shouldn't be any problems with the command being 'misunderstood' by the software, while preserving a more natural phrase (rather than a combination of seperated words).
I'm a definite buyer, by the way,
provided I'm not forced into American English pronunciation. It's not that I don't like you; it's just that I like John Cleese better.