Voice Is the New OS: Getting Ready for the AI-First World - Part 1

The history of technology is the history of human interaction with machines.

Millions of years ago, we started with sound (voice) and soon came the word (text).

90% of all human communication still happens through voice.

The keyboard trumped the punch card. The mouse coexisted with the keyboard. The touchscreen made the mechanical keyboard redundant. However, 90% of human communication still happens through voice because it is natural. Unfortunately, the progress in technology took the time to catch up with it.

There were several attempts at building perfect voice machines. In the 1960s, IBM unveiled an early voice recognition system called Shoebox. The machine could do simple math in response to voice commands, recognizing just 16 words.

Shoebox, what’s five plus three plus eight? – Photo: IBM

In the ‘80s, the vocabulary of IBM’s speech recognition system expanded from 5,000 to 20,000 words. However, the experience still felt as awkward as it did for one Montgomery Scott of the USS Enterprise.

PART-I: Rapid Advancement in ASR Accuracy Levels

Automatic Speech Recognition (ASR) is the ability of a machine to recognize spoken words. The progress was slow because ASR accuracy levels were low compared to human-level accuracy. Then, sometime in 2010, the inflection point in ASR accuracy was reached.

The chart above shows that during 2010–15, the advancements in speech recognition accuracy surpassed everything that happened in the prior 30 to 40 years. In fact, today, we are at the threshold where machine ASR will soon surpass human speech recognition.

Human-like ASR accuracy is achievable now for the first time since the dawn of AI.

Improving accuracy led to the first wave (2011–2013) of voice-based assistants – SIRI, Google Now, S Voice, and Nina – which came with some limitations of speech and Natural Language Understanding (NLU).

Like many other applications, in the initial years, these assistants failed to live up to users’ expectations and kept delivering worthless search results. However, as technology improved and users learned how to talk to their devices, voice-driven offerings increased, leading to a massive shift in user behavior.

Massive shift in user behavior: Huge spike in voice search and related commands

‘Voice-first’ trend is more evident on some Asian platforms, for e.g., Baidu. Just walk down any busy street in Shenzhen, China and you’ll notice people utilize their phones like walkie-talkies. It’s easier and faster to do a ‘voice-search’ or ‘leave a quick voice message’ than it is to type out a Chinese text message on a small mobile keyboard.

Things that were very challenging and expensive five years ago are now becoming mainstream. In the last two years alone, there has been a huge spike in people using voice to access content and navigate with their devices. The cause of this shift is very simple: human nature.

Humans are innately tuned to converse with others. It’s how we share knowledge, emotions, and organize ourselves. Voice has been part of our makeup for hundreds of thousands of years.

Voice is the Most Efficient Form of Computing Input

Source: KPCB

The global ubiquity of smartphones has given us a world of information – and answers – at our fingertips. But why take the time to type a question into Google when we can ask out loud and have an answer in seconds?

Voice interfaces allow users to ask questions, receive answers and even accomplish complex tasks in both the digital and physical world through a natural dialogue.

The above shift combined with rapid technological advances has led to a massive surge in voice-based intelligent assistants and platforms. I wrote about it here.

Voice is the Next Frontier, and It is Getting Crowded.

In Wave I (2011–13) of voice-based intelligent assistants, we saw Voicebox, SIRI, Google Now, S Voice and Nina; and in Wave II (2014 to the present), we saw the launch of Microsoft’s Cortana, Amazon’s Echo, Hound, Google Home, VIV, Facebook’s M and Jibo.

All this progress and rapid adoption leads to a whole new set of open questions. Is voice the next frontier? What role does design play in creating one of these experiences? And, most importantly, what happens when our devices finally begin to understand us better than we understand ourselves?

Remember, thus far, all human progress has been about how humans interact with machines.

Before we try to answer that, let’s refer to an important thread in technology history – The Transfer of Knowledge.

Transfer of Knowledge and Machine-to-Machine Communication

The transfer of knowledge, thus far, has evolved in three paradigms – human-to-human (past), human-to-machine (present), and machine-to-machine (future). For the first time in our history, the new transfer of knowledge will not involve humans.

Advancements in the Internet of Things (IoT), artificial intelligence (AI) and robotics are ensuring that the new transfer of knowledge and skills won’t be to humans at all. It will be direct, machine-to-machine (M2M).

M2M refers to technology that enables networked devices to exchange information and perform actions without the manual assistance of humans, like Amazon Echo ordering an Uber or placing an order on Amazon – all with a simple command.

We’re entering the age of intelligent applications, which are purpose-built and informed by contextual signals like location, hardware sensors, previous usage, and predictive computation. They respond to us when we need them to and represent a new way of interacting with our devices.

Connected Devices are Changing Everything

Voice bots like Alexa, Siri, Cortana and Google Now embed search-like capabilities directly into the operating system. By 2020, it is expected that more than 200 billion searches will be via voice. In four years, there will be 3.5 billion computing devices with microphones and fewer than 5% will have keyboards, thereby impacting homes, cars, commerce, banking, education, and every other sphere of human interaction.

Amazon Echo is a voice-controlled device that reads audio books and news, plays music, answers questions, reports traffic and weather, gives info on local businesses, provides sports scores and schedules, control lights, switches and thermostats, orders an Uber or Domino’s, and more using the Alexa Voice Service

Other intelligent platforms including Google Now and Microsoft’s Cortana can easily help organize your lives – i.e., managing calendar, track packages, and check upcoming flights – like a virtual personal assistant.

Our increasing dependence on voice search and organizational platforms has proliferated the rise of machine-to-machine communication, and this will have a serious impact on the future of commerce, payments, and home devices.

It feels like we are currently at the crossroads of search. With the growth of messaging apps, there is the argument that the future of search and commerce will sit within messaging apps with chatbots as personal assistants (for example, Facebook M and Google Assistant). Whether it is voice controlled, chatbots, or both, search and commerce will certainly not be the same in the next five years.