Biggest issue is that people hate talking to computers in public.
Alexa was the closest to achieve significant usage since you can use it within the privacy of your home.
For voice UIs the non clear boundaries on what you think it can or cannot do is also a huge hurdle. After you get a couple “sorry I cannot do that” you stop using it
Yeah, unless the utility of this devices is large enough to override existing cultural norms, there's actually very few venues where it feels "comfortable" to voice interact with a device.
I went through this exercise with GPT voice. It's an awesome capability, but other than perhaps walking outside, or sitting in my office, there's no other space where it feels "ok" to just spontaneously talk to something.
A grey area is when you perhaps have headphones in / on and it looks like you're in a phone conversation with somebody, then it kinda feels ok, but generally you're not going to take a phone conversation in a public area without distancing yourself from others.
There's a reason most casual communication these days is text rather than voice or video calls.
> it looks like you're in a phone conversation with somebody
Even though everyone's seen AirPods by now, in those rare occasions when I'm on the phone in public, I feel compelled to have my phone out and vaguely talking at it, so it's clear I'm on a phone call and not a crazy person.
I'm curious if we would see similar usage with the pin, where voice commands in public are always performed with the hand up for the projection screen (it will still prompt looks, but hopefully be clear in context, "oh they're doing some tech thing").
Of course at this price point, it's highly dubious that we'll see anywhere near the ubiquitous market penetration of AirPods (which garner understandable complaints about the price point sub-$200, and that's with a clear value prop).
I don't mind the earphones, but often headsets are entirely impractical. Most notably, in the case of any sort of weather, wind, etc. A phone can also get rained on, but its a bit easier to keep safe.
The other reason they are mostly impractical - keeping a charge. *wired* headsets were great in this regard, but then there's the wire, and now, there's the phone (that may not even support the wire?).
The weirdness is caused by the incantation all these things have. Once you can just talk to the AI without doing anything, just talk to it, it'll catch on very easily.
This fortunately is a solved problem. Or will be, once Amazon, Apple and Google get out of their asses and plug a better voice recognition model to an LLM.
Silly how OpenAI could blow all voice assistants out of the water today, if they just added Android intents as function calls to the ChatGPT app. Yes, the "voice chat mode" is that good.
I know i'm getting close to Torment Nexus territory but how do you get an LLM to run code as the response? Given that an LLM basically calculates the most probable text that follows a prompt, how do you then go from that response to a function call that flips a lightswitch? Seems like you'd need some other ML/AI that takes the LLM output and figures out it most likely means a certain call to an API and then executes that call.
With alexa i can program if/then statements, like basically when i say X then do Y. If something like chatgpt requires the same thing then i don't see the advantage.
> With alexa i can program if/then statements, like basically when i say X then do Y. If something like chatgpt requires the same thing then i don't see the advantage.
Yes, I was thinking about even something as if/then, which could be configured in the UI and manifest to GPT-4 as the usual function call stuff.
The advantage here would be twofold:
1. GPT-4 won't need you to talk a weird command language; it's quite good at understanding regular talk and turning it into structured data. It will have no problem understanding things like "oh flip the lights in the living room and run some music, idk, maybe some Beatles", followed by "nah, too bright, tone it down a little", and reliably converting them into data you could feed to your if/else logic.
2. ChatGPT (the app) has a voice recognition model that, unlike Google Assistant, Siri and Alexa, does not suck. It's the first model I've experienced that can convert my casual speech into text with 95%+ accuracy even with lots of ambient noise.
Those are the features ChatGPT app offers today. Right now, if they added a basic bidirectional Tasker integration (user-configurable "function calls" emitting structured data for Tasker, and ability for Tasker to add messages into chat), anyone could quickly DIY something 20x better than Google Assistant.
At some point you've got to get from language to action, yes - in my case, I use the LLM as a multi-stage classifier, mapping from a set of high-level areas of capability, to more focused mappings to specific systems and capabilities. So the first layer of classification might say something like "this interaction was about <environmental control>" where <environmental control> is one of a finite set of possible systems. The next layer might say something like "this is about <lighting>", and the next layer may now have enough information to interrogate using a specific enough prompt (which may be generated based on a capability definition, so for example "determine any physical location, an action, and any inputs regarding colour or brightness from the following input" - which can be generated from the possible inputs of the capability you think you're addressing).
Of course this isn't fool proof, and there still needs to be work defining capabilities of systems, etc. (although these are tasks AI can assist with). But it's promising - "teaching" the system how to do new things is relatively simple, and effectively akin to describing capabilities rather than programming directly.
> If something like chatgpt requires the same thing then i don't see the advantage.
So LLMs today can do this a few ways. One they can write and execute code. You can ask for some complex math (eg calculate the tip for this bill), and the LLM can respond with a python program to execute that math, then the wrapping program can execute this and return the result. You can scale this up a bit, use your creativity at the possiblities (eg SQL queries, one-off UIs, etc).
You can also use an LLM to “craft a call to an API from <api library>”. Today, Alexa basically works by calling an API. You get a weather api, a timer api, etc and make them all conform to the Alexa standard. An LLM can one-up it by using any existing API unchanged, as long as there’s adequate documentation somewhere for the LLM.
An LLM won’t revolutionize Alexa type use cases, but it will give it a way to reach the “long tail” of APIs and data retrieval. LLMs are pretty novel for the “write custom code to solve this unique problem” use case.
Yup, from where I see it, the only thing(s) holding llms back from generating api calls on the fly in a voice chat scenario is probably latency (and to a lesser degree malformed output)
Yea, the latency is absolutely killing a lot of this. Alexa first-party APIs of course are tuned, and reside in the same datacenter, so its fast, but a west-coast US LLM trying to control a Philips Hue will discover they're crossing the Atlantic for their calls, which probably would compete with an LLM for how slow it can be.
> and to a lesser degree malformed output
What's cool, is that this isn't a huge issue. Most LLMs how have "grammar" controls, where the model doesn't select any character as the next one, it selects the highest-probability character that conforms to the grammar. This dramatically helps things like well-formed JSON (or XML or... ) output.
Disagree. Extra latency of adding LLMs to a voice pipeline is not that much compared to doing voice via cloud in the first place. Improved accuracy and handling of natural language queries would be worth it relative to the barely-working "assistants" that people only ever use to set timers, and they can't even handle that correctly half the time.
The basic idea is to instruct the llm to output some kind of signal in text (often a json blob) that describes what it should do, then have a normal program use that json to execute some function.
ChatGPT's voice bot vs. PI's voice bot is lacking in Pi's personality and zing. PI is completely free and Ive been using it since beginning of October. Chat GPT's i have to pay $20 and for a lesser voice (personality / tone of voice is more monotone) bot.
It's staggering to me that Apple has not improved on the UI for "try again" or "keep trying", whether the fault is with Siri itself, or just network conditions. It seems like (relatively) low-hanging fruit, compared to the challenges of improving the engine. (I don't use any other voice assistants, no idea how well they do here.)
If I want to ask ChatGPT about something I will, and the speech-to-text is a lot faster than typing on my phone. There's no voice incantation needed, rather a button press, but people still raise their eyebrows and make me feel self-conscious. I wish I could subvocalize to it like I remember reading about in the book series Artemis Fowl.
I agree, BUT, i think it's going to get a lot better soon. Ie i loathe Siri because it felt like there was always some incantation i had to remember. Like a very terrible CLI. LLMs though, even if we never get intelligence right, i think can help this area significantly.
Combine that with areas like GPT Vision, (GPT?) Whisper, etc .. it'll start feeling a lot more natural here very soon i suspect.
TBH i'm surprised Apple isn't pushing this much harder. They tout Siri so hard but it's just worthless to me. It feels like apple could make a AI Pin like this, but visibly from the public side i have zero idea that the're even working in this space. It feels like they purposefully watched the boat sail away.
edit: Sidenote, Pin + Airpods would be a nice way to interface more quietly too.
The Google assistant has been years ahead of Siri and Alexa for a good while now. I've been able to give it really loose sloppy commands, even stuttering or backtracking on my sentences, and does a competent job of figuring out what I want. In my experience Siri is much more dependent on keywords and certain phrasing, and doesn't quite integrate as deeply into one's life because Apple isn't doesn't play Google's game of slurping up all your personal data and all the public data on the internet.
These next gen AI voice assistants are still a solid improvement over Google's current offerings, but they'll feel like a massive jump into the future for folks that have been stuck in Apple's ecosystem, and that's probably where the biggest opportunity lies.
Agreed. I have the new Meta Ray-Ban glasses, and have been pleasantly surprised with how soft I can speak since the mics are so close to my mouth, but still don't enjoy doing it in public.
Well, I hated talking to Siri in public because about 70% of the time it did what I want and 30% of the time it made me feel like a fool for even trying. That 30% was what killed it for me after giving it a serious go around the time Apple was rolling out shortcuts.
After watching the presentation, I am now curious about Humane’s thing though, but I’m still going to hold off for a bit because I want to see the failure modes first and I also don’t want to rush out and be one of the first to buy the brand new 3Com Audrey.
The only reason people don't like talking to computers in public is that it's distinguishable in an awkward way from talking to humans in public. That's not going to be an issue for much longer. ChatGPT voice mode is about 99% of the way there. The only remaining issue is the cadence of the conversation -- you can't interrupt ChatGPT naturally, you have to press a button.
The issue is that your private communications are now audible by the people around you. It’s one thing when it’s to another person and you can whisper and share social context, it’s another when it’s at a good volume and contextless.
These don't seem like real issues to me. They are the exact same issues you have when you are talking to humans. And the way we solve that issue with humans is that we only have conversations around other humans that we are comfortable having. We save sensitive conversations for when we are not in public.
The issues isn’t “communicating with the device” it’s “communicating around other people”.
I have almost negative interest in having to recite the technical specifics of my web search to my phone on the train to work. I have even less interest in having to listen to the person next to me trying to do the same.
Typing already allows sensitive conversations with computers in public as long as no one is directly peeking at your screen. When I talk to humans in public, I'm not using them as an utility tool to manage information for me, because computers do a better job. These aren't comparable scenarios.
> The only reason people don't like talking to computers in public is that ...
It does not seem right to speak of a single reason. There are probably multiple. So, IMHO it would be more productive to come up with a list and put some weights on the options if you want to dissect this matter.
IMHO one very strong factor / important reason (one that you ignore) is the social context. Ie the reaction of others in the same physical space, as you start talking out loud, seemingly unmotivated.
Humans are social animals, and so the reaction of others to the actions you do tend to be very important to a large fraction of the population. What is acceptable in one context simply isn't in another. Also, the exact tolerances tend to differ with the local culture (here "local" is used in the sense "geographically/physically local")
It's not just about not annoying others here. In this case it's also about a thing as imprecise as "perceived self image". Some people (I'd argue, most people) dislike having the perception that others perceive them to be mentally unstable or rude. Most people need some kind of social acceptance for the actions they do.
One significant trait of some mental instabilities (as well as some drug induced behavioral changes) is that those affected will spontanously start talking in public. You will probably know the Tourettes Syndrome, and the alchoholic rambling about because these cases often imply quite rude and offensive verbiage and/or loud volume, but these are not the only cases.
People in general are well adept at detecting such anomalous behaviour as it is part of our insticts trained through Evolution. Also the uncomfortable feelings that observing this type of behaviour leads to will lead many to react with a "confront or escape" (aka. "fight or flee") response (a stress signal), which is not beneficial to social interaction in general.
TL;DR: If you speak out in public without a very clear and socially valid reason (speaking to an object is not that) you are not only rude to others, but you also cause them stress... and you will have to face the social stigma of being perceived as insane.
I've been thinking about this recently. A colleague is participating in a group call and talking to someone I can't hear or see and that's just background noise to me, I can easily tune that out. Another person tends to vocalize his thought process sometimes and it steals my attention in a hard-to-explain unpleasant way every time.
> TL;DR: If you speak out in public without a very clear and socially valid reason (speaking to an object is not that) you are not only rude to others, but you also cause them stress... and you will have to face the social stigma of being perceived as insane.
Except... this problem is known to be trivially solvable. After all, the very act of putting a flat rectangle to your ear makes talking out loud in public not just perfectly acceptable, but mundane and not worth paying attention to (subject to social norms dictating where it is or isn't OK to be on the phone).
As for talking to yourself signalling insanity... I'd hope that stupid and probably developmentally retarding idea died long ago, and the "talking to yourself out loud in public" subtrope being dead since wireless earphones got ubiquitous some two decades ago.
The modern reality is, hearing someone "talking to themselves" is normal, and 99.9% of times means they're on a call.
The point is it's not, though. As a society we have generally established that it is rude to be speaking out loud on the phone in public. Especially on the bus or the train or waiting for same or in the shop or at a movie or any number of other places. I genuinely think it would be easier instead to list the places where it would be okay (in a busy street, if you step to one side). Even in these places there is some expectation that you show a little shame to be doing it, as though you didnt want to but had to because the call is important
I just find voice control too outward to use in public. I don't want people to know what I'm doing, even if it's something totally innocent, plus it would also be super annoying to be on a train full of people going "blah blah blah" to their devices.
If we could subvocalise with throat or other microphones/bone speaker then maaaybe, but I feel like it's better left to a brain interface and we should really just stick to touchscreen/typing interaction for now.
same with swiftkey - can handle whispered speech to some extend.
Still I would guess Meta Glasses or AirPods should be better to handle such whispered mode since microphones are so much closer. Would be interesting if Airpods had some contact mic that could pickup whispered sound inside your mouth.
Maybe the holly grail is to have something inside your mouth so you don't have to even make voice - device will figure out what you want to speak from how mouth and tongue movement - smart tooth braces anyone? :)
If people can speak more naturally, maybe they'll be okay with it. I am constantly encountering people who are laughing or talking to themselves out in public nowadays. Of course, they're probably on phone calls with Airpods in, but it doesn't seem to be awkward in a way it used to in the 'Bluetooth headset' days.
This is easy to fix IMHO. Pair a small screen in the future for typing or have a cuff link mic for whispering. You will see accessories like these pop up in the near future.
The can and cannot do problem reminds me of writing Applescript. I just want to call a function not figure out where to sprinkle in random a/the/of modifiers!
How well does whispering do with these things? I've found that I can reliably write sentences and set alerts when holding the mic fairly close on my Pixel 6.
believe it or not, here in Ottawa Canada I was just reading a post on Reddit where people complain about those who were talking or doing video calls on the streets,
I think this will be a matter of culture and the barrier will be smaller as soon as the devices are "smarter" and not making you repeat yourself many times or, not understanding what you are asking.
Alexa was the closest to achieve significant usage since you can use it within the privacy of your home.
For voice UIs the non clear boundaries on what you think it can or cannot do is also a huge hurdle. After you get a couple “sorry I cannot do that” you stop using it