These days it seems everybody in the Bay Area is talking about voice — in the past week or so we’ve attended multiple events on topics including intelligent assistants, Voice UI, and conversational commerce. Here we’ll look at some concrete use cases and proof-points that are emerging from this growing conversation, involving both text-based messenger platforms and voice assistants housed in so-called smart speakers.
Right now, text looks like the Conversational Workhorse
At the recent Opus Intelligent Assistants conference (#IAConf), several case studies suggested that text-based assistants, in a variety of formats, deliver impressive results. Digibank from DBS, a huge player in Asian financial services, has scored in markets like India and Indonesia with a mobile-first replacement for brick and mortar banking, featuring a “24×7 virtual assistant that uses state-of-the-art artificial intelligence”. As the DBS Managing Director Tom McCabe showed, automation levels for text queries to the platform are at 80%+. This level of automation can only be achieved by an omnichannel approach to conversation, the digibank bot runs inside the banking mobile app, as well as the web site and of course Facebook Messenger.
The 80% threshold also came up in another text-based bot implementation in an e-gov case study from Australia. But it gets better: in both the digibank and the e-gov instances, NPS scores were in the high 80’s, a level also hit in a text SMS bot at a Radisson hotel property in London. In the talk by Intercontinental Hotels Group, the IT director noted another proof-point: employee satisfaction was higher as well. It turns out when customers are happy, and not asking dumb questions of the staff, the staff can be more attentive and more helpful. This linkage between customer and employee sat has been abiding interest of ours for some time now.
Bots and Humans learn together
For both Text and Voice, another recurring thread is training, which any machine learning scientist will tell you is the key to success. But hey, as one long-time call center strategist from Aspect, Tobias Goebel pointed out, human agents need training too, so let’s treat this holistically. Indeed, one of the best practices from the digibank experience was building a curriculum for human agents from learnings in the automated channel (new questions reports, key skills).
Veteran voice UI designer Cathy Pearl spoke about interaction design probes she’s been conducting across both Amazon Alexa and Google Home. These can be simple and empirical, such as “what’s my husband’s name?” Responses range from contextual-but-inquisitive (“you haven’t told me that.”) to the playful (“Frankly, I’m concerned you have to ask me that.”) Cathy’s point is that we are on the very cusp of how these conversational interfaces become learning frameworks for transferring knowledge about ourselves to the machine.
Voice is crossing the Uncanny Valley fast, but context and prosody remains a problem
Cathy Pearl’s tenure in the industry is long enough to embrace pre-neural and Hidden Markov approaches to natural language understanding, retaining more hand-tooled design patterns such as directed dialog and disambiguation.
But in Cathy’s estimation of Alexa and Google Assistant, the limitations of the underlying speech understanding – and to be fair this is true of most text instances as well – lie in their difficulty in maintaining state. This may be fine for where we are – the command-driven model of Alexa is great for playing cat sounds, or ordering cat food. But the larger context promises (in our cat example) an overall feline management program that can be trained to hear a cat’s meow as a wake word.
As for the Uncanny Valley, which is the zone of creepiness induced by tech that is not-quite but too-close-to-human, hopefully we’ve moved on from teasing Siri. There are two updates here. First: Prosody is still tough to do, but is increasingly central to a natural sounding dialog that can elicit information gracefully; to come back to Cathy’s spousal example, if Alexa continued and asked “by the way, what is your husband’s name?” with the right upswing in pitch, we’d be more likely to supply that critical piece of data. Second: No- or Low-UI voice interfaces are out there, and some of them are 800-pound gorillas waiting in the shadows. Consider this fact: Comcast’s voice-activated remote handles 1 billion utterances every 90 days. No prosody needed there. And just wait till next-generation in-car voice assistants kick in.
Notes on a work/life balance for bots
Anybody with a similar interest in this space is accustomed to shuttling between text and voice as modalities for conversational-fill-in-the-blank. But we’re wondering if there is another analytic filter we can add, which is Work/Play. After all, any platform that can deliver cats meowing as a third-party developer contribution is definitely extending into the Play spectrum. At a recent SF Interaction Design meetup hosted here at Orange one demo was game called TSA, which satired in a SNL-like way the nightmare airport check-in experience, proving once again new technology proceeds through games (think Nvidia) and entertainment (think Kazaa). But Life is important – remember those 300M monthly voice searches on Comcast TV remotes.
In the draft schema below, we classify text-based conversational marketing campaigns as Work, such as Nike’s Breakfast Club messenger campaign that reached out to high-achieving high-school athletes. Whether its text (mostly these days) or voice (coming soon) promotions, the fact brands can convert through these channels makes assistants and bots hard-working influencers.
One word of caution as you tweak this model for your own devices: don’t dismiss the Games & Distractions moniker too quickly. Snap wasn’t a thing six years ago, and now AR is shipping as part of our phones. It’s likely the existing multimodal experience for voice search on SoundHound or Google Now or Cortana is just the tip of the iceberg. When facial and retinal and other biometrics get into the loop of authenticated realtime access to our personal data, ambient as well as personal assistants will really start to become part of our lives in unanticipated ways.
Conclusion: the conversation is starting to get interesting
We say thank goodness we are past the hype phase for chatbots. Conversational commerce is the most commercially viable aspect right now of a much larger conversational paradigm that is multimodal, and will already is benefiting from AI investments, and is neither platform- nor hardware-constrained. The result will involve humans in new relations to virtual intelligences at work and play.