Omi, AI Assistants, and the Beauty of Language
Omi is tool that records conversations. That’s it.
However, the developers have added more. It integrates with AI to provide transcriptions, summaries, action items, and everything else.
Most of this blogpost was me ranting into this tiny little fucking necklace and seeing how it looks like, natively, when transcribed by the device. I’ll put that at the bottom. But AI shouldn’t replace us, and shouldn’t replace human memory. So I’ll type out my genuine thoughts here.
I got one of these things. Here’s why.
Recording People
First off, recording people is ethically in a weird, gray space filled with discomfort and intrusiveness and these conglomerations of what we’ve conceptualized privacy to be.
But it feels icky to me to record people without telling them.
So I defined my boundaries for when to use this.
-
Meetings where recording is expected: Team meetings, client calls, interviews where everyone knows it’s being recorded.
-
Legal/High Risk Situations: Any situation where I feel like my life, safety, or legality could be at risk by the situation, it’s going on. But this is no different me recording on my phone. This is no different from me pulling out a pen recorder and just recording the situation. If I hit this point, people’s comfort no longer matters to me.
My safety first, other people’s safety second, and then we can make a win-win situation on comfort after those are established.
As a side note, if you don’t do this, we should probably have a chat about being a martyr (martyrs are fucking assholes) and why you should put your oxygen mask on first before putting it on someone else’s. Any other decision is ultimately unsustainable.
-
Rants, Reminders, and Conversations with Myself: This is for when I’m talking to myself, when I’m speaking at no one but me, myself, I, and the 9 other voices in my head. It’s also for reminders and high level overviews of my thoughts. This includes to-do tasks, where I can press a little button and say “hey remind me to do this thing.”
And then it goes into my action items, which is really neat for me. I have long-covid and a lot of neurospicy flavours that leads to memory gaps and other not-so-fun limitations. I have been able to overcome a lot of these with reminders, texts, calendars, and liberal use of tools. It’s no small thing to say that this new tool may help me a lot.
So that’s fundamentally how I feel about recording people.
The AI Aspect
Transformers was assigned to do language to language translation. This can be also verbal to text or text to verbal. Speech to text, and text to speech. This is really, really nice.
I use it all the time when I’m driving, and I get texts. I use my voice to text someone back. To communicate because I don’t want to start typing and take away my concentration from monstrous, several-ton vehicles that would end my life in less than a second if some other human made a stupid decision. But I can talk into it as if I was calling someone, and it doesn’t take as much.
The Perspective of Omniscience
A lot of people feel like AI is this weird like, singular pane of presence that knows everything. But LLMs don’t. Large Language Models fabricate and lie and hallucinate, with utmost confidence because all that LLMs are, in the end, are probability engines that look at what words are likely to come next based on the input that they have been given (along with previous words, context, and a bunch of other stuff). But all they do is generate probabilities.
And if you ask an LLM to generate and create and synthesize, there will be hiccups.
For me, the usage of AI falls along the core tenets of translation because that’s what the transformers algorithm was designed to do. The algorithmic process is designed to do language-to-language translation. You can take that and extract the concept towards speech-to-text, and text-to-speech. Fundamentally, this is language to language, still. Spoken english and written english are two entirely different languages. Spoken english has emotion and physical gestures and movement and tone and all of these wonderful things we add to speaking such that the words are, at most, 10% of the entire conversation.
But written english is different. To you, reader, my emotion must be conveyed through imagery, through the expression of emotion with respect to the logical barrier of disconnection from you. I am standing on the other side of your foggy mirror, frantically trying to express to you everything I feel, and all you can see are the brief moments where my hand or face touches the glass. The dimensionality behind who I am and what I feel is lost in translation. I must communicate to be understood through words and words alone.
Notice the imagery there? If I am gifted as a writer, I-
As a side note, halfway through my writing, I had to stop copilot from continuously
suggesting and completing my next sentence.
It distracted me from the thoughts that I wanted to convey, and I kept getting distracted.
I write my blogposts in VS code like a freak and use markdown. But it was so distracting.
By the way, if you want to know how to do this, just click the little copilot icon in the
bottom right and press "snooze completions for 5 minutes" or whatever the button actually says.
I did that and extended it to 40 minutes. So let's see if copilot distracts me again later.
And now you've been distracted from the message I wanted to convey to you.
Notice that.
Know that AI is not your saviour.
So let's try again.
Foggy mirror, right? I’m on one side, and you’re on the other. Two different dimensions, if you will, communicating perspectives to one another.
If I am gifted as a writer, I can turn that flat, 2 dimensional pane into a facsimile of a 3D space where I convey the gravity and depth and nuance of my thoughts to you, the reader, in a way that the emotion feels 3-dimensional. You experience it as if you were there. You empathize not with the physical space that I’m in (which movies let you experience, even more so with 3D goggles), but rather with the emotional space I’m in. Language is beautiful. And language is still language.
I believe, where models fall short, is when you ask them to Create. Not logical creations, for AI is good at fixing code and writing stuff and building front end UIs.
Rather, emotional creations. It can conglomerate the emotional creations of hundreds of people into a shared thought. Maybe AI can be used to represent the collective conscious (and maybe unconscious) of the humans that exist on the internet, which is a neat idea and an entire blogpost in itself.
But so far, truly natural and authentic creation is beyond LLMs. There’s another entire blogpost in there about whether humans are the same way, whether we are just making everything based on what we experience in nature and therefore there is no original thought, and all AI is doing is what we all do, but better.
All of this to say, AI has shortfalls. If you want a tl;dr (too long, didn’t read) of this section, here you go:
tl;dr
For me, the core tenet is translation. Language to language. Verbal to text. Text to verbal.
Many words to less words (summarization) is not efficient, but can be done fairly well. Omi does that.
Less words to big words (chatgpt style, inference and response) is the least efficient and where I think we’re messing up by going deep into building that aspect.
In my opinion, the failure rates go up significantly when you start asking LLMs to generate and create things from inferences. When you go from less, to more. That’s the space of failure for probability. You can’t predict what you don’t know 100% of the time (yet). But you can look back at what you already do know, and break it apart. Foresight is hard, hindsight is less hard.
Summarization and compacting is, in a lot of ways, destruction. And while time consuming, it can be done by anyone. Creation? Totally different concept.
So what do you wanna do, Kiwi?
Same thing as we do every blogpost, dear reader. Try to take over the - no wait, I’m not supposed to tell people that!
Seriously though, I want 3 things.
I want to explore the device and use it and see where and how and what it does.
I want to make myself efficient and stay me, for whatever level of me that exists.
I want to replace every aspect of the online interaction that Omi does with a local one.
What does that look like?
Remember I typed earlier how spoken english only contains 10% of the entire conversation that one has with another? Yeah, this device is gonna miss a lot.
And that’s fine.
I don’t want it to replace anything. I want it to give an overview, just enough, for me to remember what was said, and then I can fill in the blanks. I can fill in the emotion and movement and tone and nuance. Besides, part of the beauty of human memory is that it’s flawed. We forget things. We misremember things. We change things. I just don’t want to forget the important things.
So here’s my thoughts.
I get to build a local transcription engine. I want to minimize overhead, so llama.cpp (yay!) instead of Ollama (ew!). I get to figure out models to run locally for:
- Voice Recognition
I’m thinking Whisper-Diarization or Whisper.cpp - Voice Isolation
Same as above - Speech to Text Transcription
Probably Canary-Qwen-2.5b, parakeet-tdt-0.6b-v2, or Whisper 3 Turbo
If I use Whisper 3 Turbo, I can use Whisper.cpp too. If I want just english, Parakeet. If I want more than english, Whisper.
Also depends on what device I’m dropping this on. More compute means more power, which means I can run bigger models. But it also means more power consumption. And do I really need all of that for my intended use case?
Probably not.
- Language Models for Comprehension and Summarization
Probably whatever the best on the huggingface leaderboard is. Reddit and other places recommend IBM Granite 3.2 8b Instruct and some mistral stuff. There’s a lot to play around with this. - Vector Databases to store and retrieve conversations
UGH I have no idea. Vector Databases are used to store and retrieve conversations. I could use pinecone text-embedding-3-large but then I’m making API calls to openAI. I could use sentence-transformers/all-MiniLM-L6-v2 or all-mpnet-base-v2. Not sure yet.
Omi already provides the entire codebase, from the firmware on the device to the app code to the backend. Their documentation includes everything, including the process for setting up the backend locally.
(Oh, there goes copilot again. Mute.)
Once the LLM parts are setup, all I get to do is change the endpoint URI on the phone app, and implement the callhome functions from my models to my backend. Theoretically simple, we’ll see how simple it actually is as I build this out.
Original Rant Transcript
Transcript and Summary of my original rant into the Omi Device: https://h.omi.me/conversations/47e45366-c240-4a25-b649-24fdc6d94f40
I don’t promise it’ll stay up. If I delete it I’ll try to update this blogpost with a local copy of the transcript and summary.
Final Thoughts
I don’t want to be limited to 1200 minutes unless I pay more. I want to be able to run everything without limits on my own devices. And the entire infrastructure of this device has been set up to do this.
Omi impresses me because every part is open source. And I don’t need to modify the firmware on the device. The device, from my understanding, records and sends the recording over to the app. The audio is streamed to their backend from the app and to an external AI provider for transcription, and then it takes the transcript and sends it to another AI provider for summarization. But all of this can be replaced with my own stuff.
And I’m probably going to play around with this in the next couple of months. I hold no commitments on what I’m going to do next. But I wanna play around with this. I wanna see how this goes. I wanna see what success we have here. Because this is the most promising open source AI thing I’ve seen so far.
They have made everything available to the consumer, to tweak and create and mess with.
I’m normally not a fiddler - I just want things to work. I didn’t even touch the concept of a homelab until my friends helped me (shoutout J, Asher, and Trevor) figure out what to buy and what to run. This is an interesting journey because it’ll be the first time I invest time into building out an entire AI stack and infrastructure. And I have tailscale set up now so I can run this from anywhere!
So yeah, that’s what I’m gonna do.