sencha-logoOn Friday, I saw a write-up on using the Web Speech API for capturing and transcribing speech. I was inspired to see how this could be incorporated into Ext JS, so I started exploring.

First, I discovered that you can already add the x-webkit-speech attribute on a text field, and it will automatically create an audio-capture-ready text field in Chrome. All you have to do is hook up listeners to handle particular events. While this was promising, I found a big problem: if you try this in Canary, you’ll be greeted with a nasty deprecation warning. Apparently, Chrome will eventually be ditching the input-support in favor of full-on use of the JavaScript API.

No matter, that’s more flexible anyway. Based on this conclusion, I dove into the API and created an Ext JS wrapper that supports interactions with the Web Speech API. You can try out an example, and grab the source on GitHub.

About the API

The API has a fair amount to it, so I’ll highlight some of the configuration options, as well as describe a bit about the results, which can be a bit confusing.

Configuration

  • continuous: If false, you get a one-shot stab at capturing audio. Once the audio is no longer detected, the capture automatically ends. If true, you can “continuously” capture audio, even if no audio is detected (such as in a pause for a breath). This allows you to conceivably record indefinitely, although Chrome apparently caps the total recordable duration to 60 seconds. So no support for novel transcription yet :)
  • interimResults: If true, the recognition service will return interim results while audio capture is still occurring. This is nice if you’d like to give visual feedback of the capture in progress. If false, only the final recognition capture result will be returned
  • maxAlternatives: This defines the maximum number of alternatives that are returned per recognition result. Since each alternative is ranked by confidence level (see below), it’s probably not terrifically useful to return more than 1 alternative, but could be interesting from a nerd perspective to analyze additional alternative recognition results.
  • minimumConfidenceLevel: While not a part of the Chrome API itself, this local configuration variable allows you to filter out results based on a minimum confidence level. I’d suggest leaving it at 0 or .5 at a maximum in order to provide fastest feedback
  • logFinalOnly: Another local config, this tells the wrapper whether or not it should process and log any results that are “final” or not
  • chainTranscripts: Also a local config, this instructs the wrapper whether or not it should chain final transcript results together. For example, if you want to do 5 60-second sessions in succession, you could set chainTranscripts to true in order to end with a final, single transcript from all 5 sessions. If false, you’d be left with a transcript of the last-captured recognition session.

About Results

The result event is interesting because it has a lot going on. Whenever you receive a result from the recognition service, you receive the following:

  • SpeechRecognitionResultList: A list of all recognition results returned for the current capture.
  • Each List contains N SpeechRecognitionResult objects. Each SpeechRecognitionResult is either final or not (in the case of interim results)
  • Each SpeechRecognitionResult is composed of N SpeechRecognitionAlternatives, which is where the transcripts to the audio capture are stored. Each SpeechRecognitionAlternative is composed of two properties: confidence and transcript. The confidence property indicates how sure the recognition service is that the transcript correlates to the captured audio. Helpfully, the recognition service returns the highest confidence SpeechRecognitionAlternative as the first in the array that are returned.

While all of this is nice to know, the Ext JS wrapper takes care of all of this for you. While you can configure it to handle results in some different ways (see Configuration above), out of the box you don’t need to mess with the technical details of the recognition service’s results at all.

Unless, of course, you want to. To help illuminate what’s happening, the wrapper includes some logging. Whenever a result is returned from the recognition service, a snapshot of each SpeechRecognitionAlternative is logged to a store internal to the wrapper class. If you want to see the results which have been logged to the store, simply call getResults() and you can interact with them just like you would any other Ext JS Store.

Timings

If you are interested, the wrapper also keeps track of timings of 4 areas of the interactions with the recognition service:

  • sound: When sound is detected
  • speech: When speech is detected
  • audio: When audio capturing is occurring
  • overall: Duration between start and end events

To get the durations for these timings, simply call getSoundDuration(), getSpeechDuration(), getAudioDuration(), or getDuration(), respectively.

Caveats

Ok, now for the caveats :)

  • Chrome only! The Web Speech APi is only currently implemented in webkit, so tough luck if you want to use it in Firefox, IE, etc.
  • The Web Speech API is still being developed, so there’s no telling if the spec will change, Chrome’s implementation will change, or both. This was really just for the fun of experimentation, so I would strongly suggest against using this for anything real, unless you are willing to support whatever changes need to be made if and when the implementation changes.
  • Grammars/Lang/ServiceURI: You’ll notice these configuration options in the wrapper. Currently, there is no support for these built in to the wrapper. They are more for the purpose of stubbing out future implementation to support these aspects of the Web Speech API.

Wrapping Up

Despite the lack of implementation of the Web Speech API in current browsers, I think there are some really cool things coming that will leverage it once it has broader adoption. I hope this wrapper is interesting, if for nothing else than providing a demo of something that will be a reality in the not-to-distant future.

As always, I appreciate any constructive feedback, so please let me know what you think in the comments!