Speak: The first unicorn of AI language learning, ARR $50 million, over 10 million users

Image source: Generated by Unbounded AI

The first unicorn of AI language learning is here.

Recently, the spoken language learning application Speak announced that it has completed a Series C financing of US$78 million, with a valuation of US$1 billion.

In June this year, Speak just completed a B-3 round of financing of US$20 million, with a valuation of US$500 million. In just half a year, Speak’s valuation doubled.

According to information from the Investment Practice Institute, Speak’s ARR is currently close to 50 million US dollars, with an annual growth rate of 100%.

Speak was founded in South Korea in 2017. Before 2023, the Korean market’s revenue will account for more than 90%. PMF has been verified in the Korean market. Nearly 6% of Koreans are using Speak to learn English. This is also an important reason why Speak was able to obtain financing before. After the B-1 round of financing at the end of 2022, Speak began to explore the Japanese market. In addition to the Japanese market The two major traditional markets are South Korea and South Korea. In the second half of this year, Speak’s revenue in the Taiwan market began to grow rapidly.

Recently, Redpoint AI podcast hosts Jacob Effron and Patrick Chase conducted an interview with Connor Zwick, the founder of Speak, and conducted an in-depth discussion on Speak’s business model, product interaction design, and how to use AI. , the podcast was compiled by Founder Park.

Some points worthy of attention: With the support of AI capabilities, Speak allows everyone to have a 1-on-1 personalized language learning tutor at a very low cost, which previously required a very high cost. The core of AI tutoring is personalization, which includes personalization of courses, personalization of learning modes, and personalization of tutors. This is completely different from the traditional model of manually designing courses for everyone to use. Not only is the course content different, but the other biggest difference is that the cost structure behind it is completely different. In Conner’s view, any form of tooltips, user education, or functional explanations means that our design is not yet complete enough for users to intuitively use it. Model evaluation in vertical scenarios is difficult. If you can refine a perfect evaluation criterion, you will basically refine the problem you want to optimize, and then optimization will become straightforward. At its core, Duolingo is about providing an informal learning experience, and Speak is focused on helping people who have spent more than 10 years learning English, trying to improve their fluency, but lack the opportunity to communicate with real people. What Speak provides users and what they fundamentally pursue is the connection between people. They seek to connect with more people globally, whether professionally, culturally or socially. They want to interact with people who are not justPeople who speak their own language make connections. Two thousand years ago the best way to learn was as Alexander the Great learned from Socrates, and this is still the best way to learn now. This will obviously change with the development of AI and education. 01 The core function of Speak is to make people speak foreign languages fluently

Patrick: Can you briefly introduce your product?

Connor: Of course. Speak is a solution that allows people to speak a foreign language fluently and specifically learn how to communicate in this language with other people in real life. I would contrast this with other learning methods that may focus more on grammar or memorization, or focus on using flash cards. In fact, we have also seen many language learning applications based on memory cards. So that's our core philosophy in a nutshell.

We have a complete set of teaching methods and methodology in the actual learning process. Of course, technology is also incorporated into it. But fundamentally, we teach people word combinations from the beginning and then have them learn those high-frequency words that often appear together in everyday conversation. Next, we teach you how to say these words, and then let you practice until you can master them. Then you master these vocabulary patterns and combinations, and we'll have you practice them in simulated conversations where you'll pursue your real-world goals of learning the language. So, if I want to learn Spanish so I can go to Mexico City and hang out with friends, then I might practice something related to that and I can really apply that particular language.

And all of this is extremely personalized, for each user. So no matter what people's motivations are, what their interests are, what their level is, everything is part of the course and everything is in service of achieving the user's personal goals.

Jacob: What I think is cool about your product is that in addition to learning the basics like vocabulary, you also help people improve the way they speak. I think that includes accent issues, which can help people express themselves more clearly. How did you build this feature?

Connor: Speak has different modes to choose from and can be very professional. At least in the short to medium term, we can continue to do modeling ourselves, such as developing our own internal models for certain tasks. For example, we can do something more specialized and niche than the general model.

In the long term, large base models will continue to replace a lot of things. But in the short term, as a concrete example, we've developed our own internal speech recognition system that works really well with people with strong accents to understand what they're trying to say and understand the specifics of what they're committing. errors and making sure that it's all very fast and reliable and that information is fed back to our customers in a fast and reliable way that enhances our product experience.

We also have a speech recognition system built on all the data, capable of detecting mistakes learners make in pronunciation and other personalization areas. There may be a day when a really good multimodal speech model can do this, but it's not going to happen tomorrow. So, even though we'll only be using these models for a few years, they'll still be hugely valuable to growing our business. This isn't necessarily our long-term core expertise or strategic pivot, but it's certainly going to be very valuable, at least over the next few years.

Jacob: Because when you build these models, you first get something that you use now, and then in the process of developing other things around it, you gain new knowledge.

Connor: That's true. Going to build a larger business with more users allows us to collect more data and then build other models on top of it and have more resources to make bigger investments and all these other things. Like I think at some point it's really like getting the ball down the court and that's the first thing you focus on. You can't think too abstractly.

02 How should language learning apps design interactions?

Jacob: You are currently trying a different way of learning languages. I feel like we've all had this experience in the past, like getting a tutor or reading aloud, but it's not what people expect when they first open the app. So, is product user education really difficult? Or will it be intuitive for people to use from the start? What have you learned over the past few years?

Connor: One of the things I always hold on to when working with our product and design teams is that any form of tooltips, user education, or feature explanations means our design isn’t complete enough. There is no way for users to use it intuitively yet. But it was really challenging because we talked a lot when we were hiring designers that we were really exploring new interface paradigms around audio-centric experiences. Just as people communicate with technology in a way that is fundamentally unfamiliar to them; they have never interacted with technology in this way before, so too does this open-ended nature.

For example, in the new user onboarding flow, when you open the app, only a microphone button appears. We'll ask you a simple question: "Why do you want to learn English?" You just press the button and start talking. But people might be thinking, what should I say? What tone should I use to say it? Should I give a minute-long answer or just a few words? Users were surprised, not completely shocked, but certainly not what they expected. This is a great example of how to make an experience like this even better. Because we want them to express their motivations in their own words and then, once they provide that information, provide them withA highly personalized experience.

Of course we want to show users how to use the app from the beginning. But at the same time we also know that this is completely new territory. So how do you design a way that’s both intuitive and future-proof? This is a typical challenge.

But I think minimizing user education is a goal. With this feature in mind, it's a great example. We probably launched the first version a few quarters ago. Since apps like ChatGPT have become so popular around the world, users’ understanding of this conversational paradigm has changed significantly. So I do think all of this is evolving rapidly and it's another design challenge worth thinking about.

Patrick: How do you see UI evolving over time? For example, will it gradually transition to audio and eventually become an intelligent agent that can talk to us? Or do you think the current UI will always have its unique value? Maybe we select what we want from a menu, then make the transition and have a structured conversation around that.

Connor: I guess that the future UI will become more fluid and natural, can take into account both functions, and will be more intuitive to use. This is a question we often think about, and we use the word "hybrid" to describe it. But how to build a hybrid interface that allows us to choose to speak or type at any time?

We are in the early stages of the development of this paradigm. While voice isn't always the best option, it does make it more convenient in some situations. As speech models continue to advance, this will be a huge shift. However, there are situations where we may be more inclined to type or click, especially when we have a keyboard, which is actually a faster way to input. So that's my guess and that's where we're thinking.

Patrick: The progress in the voice field has been huge. Now, we can interrupt what the model is saying and they sound more natural and vivid. Over the past few decades, we have been in a phase of adapting to technology. And now, I think technology is going to adapt to the way we interact, which is really cool. The field of speech is undoubtedly one of the most interesting fields.

Connor: I completely agree with the above point of view. Another related question is whether you push information to users or attract them through in-depth understanding of users. In the past, if we wanted to obtain valuable information, we had to actively search for it ourselves, and 99% of the information that was pushed over was worthless, such as spam emails and push notifications. They simply send messages.

And now, a new way to unlock the interface has emerged, which is to think about our needs while staying in the background. It may observe our information or data and handle tasks for us in the background. This is an important question that we have been thinking about in the field of speech. Although it remains to be seen how it will ultimately developWe don’t know, but what we can foresee is that the future interface will undergo earth-shaking changes.

Patrick: What does this mean in the real world? For example, if you are traveling in Paris, your device will say: "Hey, have you ever thought about or need to order food from this cafe?" Connor: It can certainly do some Proactive things like this. But a lot of the things we're thinking about here actually require quite a bit of computing power and there's going to be latency, so it's not necessarily something that we want to be able to present to users right away, for example. Use Speak an hour, so for us, maybe in Korea where the user is, we run those at night when that user group is sleeping GPU and start analyzing users to see what they have done today so that we can know what kind of key points and courses they will be interested in and can send them to the next lesson tomorrow. < /p>

Jacob: Users will receive these as soon as they open the app the next day.

C. onnor: That's right. It's about doing these things in advance and the more information it has about your users even when you're not using it, the more we can do. It's like a completely new area. I don't think many people are involved in it yet, but it will be a profound transformation.

03 Pay as many people as possible, or replace offline tutoring classes

Jacob: Are your course settings planned in advance? Or will you change the course setting as your study progresses? Will you stick to a certain guideline and set a learning path in advance? Or are there different paths that can be adapted to each situation?

Connor: When considering the curriculum, I think there is no contradiction between the two methods, so they can be combined.

Take learning a language as an example. There is a certain correct order for learning a language. For example, you need to start with the most basic words and vocabulary, because we have. The 100 common words are used 20% of the time; the next 500 words are used 80% of the time. So you should learn a specific set of words first. However, maybe in the initial stages of language learning, the specific order of these words can be. Personalize and customize for users, maybe you should learn 500 first One word is completely different from another person's, so we will be more flexible.

But for a long time, humans will need to be involved in setting up the curriculum to make it more suitable for needs. To control the details, especially the overall learning strategy and methodology level, but now more and more work is actually done by our curriculum team.Completed by external teams, such as the machine learning team. It's an interesting challenge because the machine learning team now needs to understand the principles of our learning methodologies and how we teach people language, which is very cross-functional.

So, back to your question, I think the ideal situation would be both. The coolest thing is when users deviate slightly from the usual path and start learning in an unconventional, personalized way.

Jacob: Yes, I don’t know if your AI has ever made an unexpected move like AlphaGo in playing chess, which surprised the user, but the effect was surprisingly good.

Connor: I don’t know if you have read the science fiction writer Neil Stephenson. He wrote a book called "Diamond Age", which is basically about this. The book is about a girl who finds a book that is basically AI-driven and is an all-encompassing enlightenment book that can teach her anything. The whole story revolves around this. It actually inspired me a lot, especially the way it describes how the book provides comprehensive guidance to individuals in a unique and creative way. This is something we discuss internally a lot, and I think it's a really cool idea.

Jacob: Now what I'm curious about is that the cost of model calls seems to be going down every few months. Therefore, I am always curious whether you feel there are any limitations in product development, such as some things you want to do, but cannot be realized because the cost is too high. Or do you feel that model inference cost is simply not that important?

Connor: I don’t think we are too limited by this. Obviously compared to enterprise services, the cost per user is lower for a larger company like ours. We don't have free users, but we don't feel restricted by our subscription users.

But even if we do feel constrained, we will probably develop anyway and bear those costs in the short term because we believe the costs will come down over time. This does feel like OpenAI's strategy, the lower the cost, the greater the demand. If you lower costs, demand increases one for one and these companies can make more money.

Patrick: How do you think about pricing your products?

Connor: Pricing is really important. I feel like we haven't had enough time to think about this in depth, at least not as much as I'd like. But in general, we are considering two extreme cases.

On the one hand, how do we make Speak attractive to anyone who wants it? Because at the end of the day, we're delivering a software solution to a problem that traditionally couldn't be solved at such a low marginal cost. So we have an opportunity to provide value to millions of people. I thinkThe way to grow your business is to reach all of these people with a product and charge them over time. This is just one aspect.

On the other hand, I think there's actually a very interesting opportunity here to charge more for consumer products because what we're really offering, there are millions of people as consumers right now. Pay hundreds of dollars a month for something like in-person tutoring or after-school classes. So, the question here is, is it possible to also offer this high-end experience and charge for it? It doesn’t have to be as expensive as extracurricular classes, but it must be worth every penny.

If we can offer something different and truly valuable, we won’t get into a price war. But this is still in the very early stages. And pricing is often counterintuitive and always changing.

Jacob: What about model evaluation? For example, when a new model comes out, I obviously have a high-level goal of making it easier to learn a language, but how do you know if the new model you're testing is really good?

Connor: I think assessment is one thing that people tend to underestimate how difficult and important it is. For our machine learning team, I often say that perhaps evaluation is the most important thing. Because if you can distill an evaluation criterion, especially when we talk about the open-ended tasks that large language models often perform, but also in speech, if you can distill a perfect evaluation criterion, you basically distill Once you know the problem you want to optimize, optimization becomes straightforward.

To give a specific example, in terms of speech, let's not consider those vague tasks. For speech, it’s not just about how many word errors we have and how many words we mislabel, but more importantly whether we capture every mistake of the user. Sometimes a user says a word that others simply cannot understand, and we can now train a model to understand words that humans cannot understand in communication. So how do we evaluate this? What is correct and what is incorrect? Our assessment can become very complex.

Patrick: Yes, I've always been curious about evaluation. So, what will happen within the Speak team when GPT-4 is released? Do you run all the evaluations on it and then decide, "Okay, we're going to release this"? How did you do it? When you see the good response to GPT, do you say, "We heard it's great, we're going to open it to users today"?

Connor: No, we have a complete process. Fortunately, we generally have a close relationship with OpenAI and we can usually get a good feel for their models very quickly. For example, we now have a lot of internal tools and technologies, we have 40 different major tasks to complete, and we have all different evaluation loops for those tasks, including human evaluation, all of which are basically distilled into a playbook. This is necessary, otherwise every timeWhen change occurs, a lot of chaos occurs within an organization. So, that's something that we've gotten better at over the past year and really benefited from.

Jacob: Someone on the podcast recently suggested that instead of doing the evaluation yourself, make a perfect test data set, release it to some of your customers first, and track the product metrics you care about. Your customers will tell you quickly if this works.

Connor: Yes, I forgot to mention it just now. It's absolutely important to track and experiment to see if the metrics we care about and the protective metrics are all working properly. This is also a very important link.

04 Speak and Duolingo solve completely different problems

Jacob: Many people would say that generative AI is a very cool technology, but it mainly benefits existing leading companies. And in your space, Duolingo is incorporating a lot of these new things now. So I'm curious, what are your thoughts on this in general and in the specific context of AI technology.

Connor: Broadly speaking, artificial intelligence has indeed helped existing leading companies and maintained their status. If you were able to solve a certain problem better than others before, and now this problem can be solved using AI, for example, for customer service, if the problem you solve is to make managing a group of customer service staff more efficient, and you have a Very good software to manage and evaluate these people. And now there are LLMs that offer fully automated solutions. I don’t think that’s actually going to be that helpful. It might help you better evaluate each agent, but the change here isn't about making agents more efficient. And if the entire process can be completely automated, that would be extremely disruptive.

For example, I think when talking about language learning and Duolingo, we are fundamentally solving very different problems.

The vast majority of Duolingo’s subscribers are native English speakers primarily from the United States, United Kingdom, and Australia. One of the most shocking things I’ve heard is that most Duolingo subscribers have never learned a language before. And they're now using Duolingo to learn, which is really cool. Duolingo allows these people to start learning languages who might not otherwise learn them.

Jacob: They must have seen someone else using Duolingo on TikTok and started using it.

Connor: TikTok is definitely responsible for a lot of this. The point is they've created a really good product that's like a casual, almost brain-training program that makes you feel like you're doing something meaningful rather than being on InstagramHang out or whatever. I think this is an amazing achievement. Of course they also care about whether the user speaks fluently. But at their core they are built to provide an informal learning experience. But I’m not sure AI can necessarily help you build a better informal learning experience. Maybe, maybe not.

But I think for us, our customer base is actually very different, our users generally don't use Duolingo. Our specialty is getting people who don't speak English to learn English. We are very focused on helping people who have previously spent over 10 years learning English, trying to improve their fluency, but lacking opportunities to interact with real people. So people use Speak for completely different purposes than Duolingo.

I think in this case, artificial intelligence is obviously very helpful for our use case. This once again demonstrates the importance of understanding your actual core product market fit and what problem you are solving for which users. So I like to distinguish between sustaining and disruptive changes brought about by technology.

Jacob: If we have real-time translation tools and the ability to change accents, does that mean some of your users don't need to learn English?

Connor: Maybe some users will be like this. But I think there might be some other factors here.

First of all, what do I think the best translation in the world is? When two world leaders talk, if you look at the unedited version of their actual exchange, you'll see that there's a huge delay and lag between when one person speaks and when the other person responds. That's because languages are fundamentally different, right? For example, the order of words is different. Therefore, there are inherent delays and imperfections in translation.

But I think what’s really important is that if I refine our services, what we provide to users and what they fundamentally pursue is the connection between people. They seek to connect with more people globally, whether professionally, culturally or socially. They want to connect with people who don’t just speak their own language. I think that for most people, the reasons behind the time they spend learning the language are fundamental and won't change. Even the best AI real-time translation, like the one from Babelfish, can’t solve this problem. I think it would be really cool to solve the communication problems that tourists have when they randomly go to a country. But I think that for our user base, those who fundamentally pursue speaking fluency, their reasons for learning are very important.

Jacob: I remember when GPT-4o was released, Duolingo’s stock fell immediately. I don’t know if this is because the hype for AI is too much, or if this is rational.

Connor: Well, I don't know, the market feels very noisy right now, so I'm not really sure. I am hereI don’t have a particularly strong opinion on this. But the 4o’s voice-to-voice capabilities have us really excited. I think Duolingo's stock price dropped probably on the assumption that people are now using ChatGPT to learn languages. But for us, we're very excited about speech models that integrate multiple modalities. After all, our app is called Speak.

Jacob: But at least speak.com will be a very valuable domain name.

Connor: Yeah, maybe this will end up being all we have left. But I think the broader point is that more people will actually use ChatGPT. Many people will start using ChatGPT to learn the language and practice it. This would be a very interesting tool. But I think it's fundamentally a good thing for people to use ChatGPT and realize that they can use AI to learn languages.

Then, if they really want to learn in depth, they will look for more professional solutions and more effective learning methods. Because at the end of the day, if you’re willing to spend the time and money to learn a language, it’s not something you can do in 10 hours. This is a habit that takes months, even years. You'll want to find the most efficient solution.

So there is a market space for us to specialize and build more effective learning solutions. We are focused on capturing this segment just like Airbnb captured home sharing and Uber captured ride sharing. We see this as a similar opportunity. And in many ways, this will increase the number of people using AI to learn languages, which is pretty substantial.

Jacob: That way, they'll be less confused when they see your first cue word.

Connor: They will understand these paradigms for interacting with AI.

Jacob: With the release of GPT-4o, people have been asking whether audio-only models still have a place, or whether multi-modal models can do this and have The most complex inference engine. But it seems like there's definitely room for arbitrage, at least on the fringes of voice cloning and other areas where OpenAI won't be involved. But do you think these audio-only models still have a chance to win?

Connor: I bet there is. I'm not super confident in that either, but I would say even for Speak, we're building our own speech recognition technology because we have a specific use case. We believe this use case will not be fully solved by these large audio models in the near future.

So I think that's going to happen in situations where you need to have specific security for your voice data, or you need to deploy it locally, or you need a very specific vocabulary that's not on the Internet. Such a general fieldOn the scene. So I think there's still opportunity here. And on top of that, startups have the advantage of being willing to take more risks than larger companies.

05 The goal of AI is to completely replace the human role in the learning process

< p>Patrick: I'm curious how much investment you put into these models? How big is the investment in computing power, team or resources?

Connor: It's definitely a very large investment, but it's just one of many investments we make. It's difficult to build a model solely for a specific task because you need data and expertise to do it.

At the same time, as you mentioned before, people often ask under what circumstances can we just use existing large models and base models instead of building one ourselves? Model.

A good analogy here is that in the 1980s, there were 10 or 12 different personal computers from different companies, including companies like IBM or Atari. Apple is a good example. Instead of using their own processors, they used Intel processors. There are good reasons for this. There are dedicated companies that make processors, and they spend huge amounts of money building these things better than anyone in the world. That's not to say, though, that Apple isn't building something truly valuable on a technical level. We're in the same situation today, where we were building not just MacOS, but all the firmware that goes directly on top of the processor.

Similarly, today’s AI firmware is a machine learning framework. What we are currently doing is to make these different models cooperate with each other to empower different tasks, as well as backends and products, etc., and perform very well, which is technically not easy. Therefore, this part of the technology is also quite complex and in-depth. For me, when talking about investment, I think modeling is just one of them, and things like what I just said are the bigger investments. And if I were to say what our long-term technology model is, I think those are the bigger investments than modeling.

Jacob: As we get better and better models, what improvements can you make in the product?

Connor: We’ve known from the beginning that technology still has a long way to go, and we can’t perfectly predict the future. But what we're doing now is that over the next five to 10 years, as we get more data and more computing power, the models will get better and better, and eventually they'll surpass humans on a variety of tasks. Ultimately, this means we can completely replace humans in the learning process.

As long as we always use this as a guiding direction and clear goals. None of the product decisions we make are driven by short-term advantages.ization, but aligning with our long-term vision. So, we always think like this: think about what we can do today, think about what we hope to be able to do tomorrow, and make sure we are developing in that direction. We think of it as a series of steps, like a flight of stairs. So, every year or two, we move up the ladder step by step, and the product evolves, but all along a consistent and coherent vision.

I think that's why we've been able to be so leading in AI-based learning, because the whole product, even in the early days, its unlocked capabilities were more around very accurate speech recognition, allowing people to You can speak into the app and have a great learning experience, and the app understands exactly what people want to say. Then add speech recognition, then add basic language understanding, and keep growing from there.

Jacob: A question that many founders are thinking about now is: to what extent will you build around some of the shortcomings of today's models, or will you do nothing and wait for the AI model to be used in two years? After a major technology upgrade in the next year, some problems will be solved, so there is no need to invest so much in edge functions and speech recognition, because AI has made great progress in speech recognition.

So I would like to know, how do you think about these issues now? Obviously, you can now solve real problems for users, but at the same time, you can also get a lot of benefits for free from the improvement of the underlying model. How do you think about this problem?

Connor: Yeah, that's really the issue right now. I think for a lot of companies, if you're building on top of these technologies, you really need to have a deep technical intuition about how they work today and how they're going to work in the future. For example, the timeline of technology advancements, that's something we've been thinking about at a strategic level to understand where investments should be made.

However, I also think that an extremely important part of business is being able to better understand and articulate what problems you are solving for people. Even if the technology is not mature today, it is still very worthwhile to understand and solve this problem, even if you may need to replace the entire technology system in a few years.

I think Speak has a huge head start, even though there are a lot of things that it can't do right now that it hopes to do in the future. We’ve been working on learning methodologies and we’ve been thinking about things like how to engage users and keep them motivated. So, I think certain technologies do need to find a balance. But overall, I think technology is only going to get better and it's important to make sure you're building something of value outside of the core technology.

Patrick: What is the large-scale infrastructure that you are building in the machine learning framework for evaluation? Or is it the infrastructure for linking different models together or for inference?

Connor: These are all. And there's more. The current trend is that many companies willTo provide services in these areas, often what we need to build is very specialized and unique, and a bit hardcore, so that off-the-shelf tools cannot be used. So we have to build it ourselves.

What we focus on is how to make these models perform better on a single task, and then how do we coordinate these models? And then how do you continue to collect new data? When to fine-tune? How to evaluate? And then generally speaking, you start thinking about how to build large-scale infrastructure around these models, and then how to invest in achieving true representations of language that can then be extracted and built, but not knowledge graphs, but understanding when users speak You have to be fluent and when you are not fluent. This is just one example, but now at least 50% of the time we spend in product development is spent on things related to these systems. So, it's definitely a huge investment for us.

Jacob: What is the most painful thing for you? I think when you're working on early-stage technology, you think 10 years from now, when you tell other engineers, "I've had to do this before," they'll laugh, and I'm sure there will be a lot of that. But what’s hurting you the most right now?

Connor: From a technical perspective?

Jacob: Everything that you need to do to make the project work.

Connor: I do think some of the practices are quite stupid. For example, we are still optimizing the prompt words. I think this is just a temporary phase and in 10 years people may be thinking, why do this? Is it necessary?

06 Don’t plan to develop the basic model yourself, the cost is too high

Patrick: In terms of audio models, you're obviously building a lot of features internally, but are you waiting for certain features to come out and say, "If this comes out, then we can try more new things?" Or are you waiting for some specific functionality to appear in the model layer?

Connor: Certainly, I think these general cognitive models, these LLMs, are still in a very early stage. I would even put multimodal audio connected to LLM in this category. As mentioned previously, we believe these capabilities should not be built in-house because the cost to train and develop these models would be hundreds of millions, if not billions, of dollars. So this is the most critical. And our main focus is multi-modal audio. For us, this is truly the Holy Grail. It's going to take some time to make it really great. And there are a lot of possibilities to build specialized things on top of it. But as it continues to get better, I think the possibilities are endless.

Jacob: So what specific technologies are you waiting for to emerge? CompareLike GPT-6? By then it might be possible to build voice agents with accents.

Connor: Yeah, there's definitely a lot of improvements on the audio side to improve multimodal capabilities, multilingual capabilities, and be able to produce something that feels closer to a real teacher, although that's not always necessary, But sometimes it's very useful. This means a more natural and lower latency technology.

Because instead of using a speech recognition model first and then speech synthesis in LLM, you now have a continuous model that can do all of these tasks in one operation. And it doesn't cause a huge loss of information because it doesn't reduce the complexity of your speech into a small piece of text and then feed it into an LLM and then try to expand it back into something with the right nuance and intonation. Synthetic speech.

Not to mention it provides a deeper understanding of what you said, how you said it, your confidence, your emotions, and any mistakes you made. So this is a very important technology that is coming to the fore. We are still in the early stages and have a lot to prove. But as far as the cognitive model is concerned, I think there is obviously still a lot of room for improvement in terms of reasoning and general abilities, such as being able to complete a task from beginning to end and ensuring quality and reliability. As for how these will be implemented, it has not yet been determined.

Jacob: If you had such technology, what would you use it for?

Connor: I think it means being smarter about lesson planning. I think that's the biggest missing piece for us right now is how to do it well and reliably.

Jacob: Currently this still requires some kind of human intervention.

Connor: Yes. I think we're not much more constrained in inference than most use cases. One of the really exciting and special things about the language learning field is that we can use current technology to build something very practical and disruptive without requiring a lot of reasoning power. We can basically eliminate human intervention altogether and build something cool and useful. Many industries still find themselves in need of humans because technology alone is not good enough.

So, in any slightly higher-stakes field outside of language learning, you need higher levels of reasoning and consistency, and lower rates of hallucinations. I think that's a particularly noteworthy aspect of our industry, and it's an advantage we have over other areas.

Jacob: When you look at those OpenAI demonstrations, such as letting AI answer math problems, such as sine and cosine functions, do you think, will we do something like that in the future? Or are those areas outside your focus? Will you create a sci-fi-like learning platform where people can learn everything in one place?

Connor: I think there's a big opportunity there. We'll see where we canIt's useful enough, but I think there are many other opportunities beyond language learning. There will be many other companies entering this track.

There are many other complexities in these areas, but fundamentally we focus on three areas. One is school. Obviously, people spend a lot of time studying in school. The second one is what I mentioned earlier, Enterprise and Professional Skills, which is a huge opportunity to be able to do certification, assessment, development and skills development for businesses. The third is personal learning. I think individual learning is what people are ignoring right now, but it's a huge group.

Jacob: I think most people are using tools like ChatGPT.

Connor: Many people are using it. I think personal learning will be one of the biggest areas of change in human activity. People don’t realize how vast this field is, but a lot of the things we do every day, like reading books, listening to podcasts, watching videos on YouTube, all of these can be classified as learning or related to learning. Behind this is the desire to know more information, and people are trying to become a better version of themselves.

In the early days of the Internet, some predicted that people would waste a lot of time online. But it turns out that the advent of YouTube or search engines changed the way people get information, but in the early days, no one really realized what a search engine was and what it might look like in 2020 or 2025. I think the same goes for personal learning. So just on the consumer side, it is possible to build a more professional and advanced information acquisition platform. But there will be many different companies entering this track, and the competition will be fierce. When the time comes, we will know who has the right entry angle.

07 The language learning model has not changed in 2000, AI may change it

Jacob: How do you personally think people will learn in the next 10 to 15 years?

Connor: I think it’s really a highly personal thing. Just like a scene from the movie Her, the AI has long-term memory and a complete mental map of everything. It knows your interests, personality and what you want to know, and then it uses that information to give you the right information when you need it.

I think there are different levels of this, like platforms like Google or YouTube, which are very widely used platforms and very powerful. There will also be more informal messaging channels, like chat platforms, maybe ChatGPT, maybe something else. But I do think there will be more professional solutions as well. I think the question is, how do you monetize it? What are the user usage patterns?? Will there be more niche things, or will it become more specialized?

Jacob: Will these AIs share memories and knowledge about you? Do they know everything?

Connor: Yes, you only need to look at web2.0 to know this.

Jacob: I used to think that encryption technology could solve the problem of privacy.

Connor: We'll see. But maybe it will end up being something more like email. I have no idea. But broadly speaking, no one denies that AI will change everything. People are just too optimistic about how quickly things are changing.

Patrick: Indeed.

Jacob: Technology entrepreneurs just need to be overly optimistic.

Connor: Yeah, you need belief and excitement. I think there will be major disruptive changes in the field of education.

If we zoom in and look at it, just like someone said before that "software eats the world", software has indeed almost occupied the world in the past few decades, but what about education? Even though we now have Chromebooks in every classroom, has the quality of education changed? Basically, people are still taking quizzes, but they're doing it on a laptop instead of on paper. They are still learning. They might be watching a class live-streamed, with an instructor teaching to a million people.

Jacob: They will use electronic memory cards instead of paper memory cards.

Connor: They use electronic flash cards. But the quality, the efficiency of all of that, I really don't think that much has changed. Just like two thousand years ago the best way to learn was to learn like Alexander the Great from Socrates, this is still the best way to learn. This will obviously change with developments in AI and education. Learning is one of the most important human behaviors. The field of education will undergo huge changes, which are currently difficult to predict and difficult to grasp in their entirety.

Patrick: How soon do you think this change will happen? Five years, ten years, or sooner?

Connor: There is a saying that nothing big will happen in a few years, but a lot more will happen in a decade than you expect. I think that's usually the case.

People will hype everything now. There will likely be significant deployment of new technologies. But in the long term, I think a lot of things will change. But what I'm also concerned about is that on the research side, on the technology side, people are very excited about Transformer. I hope this doesn't make us too obsessed, there are many other things to look into. Where Transformer will take us is still unknown.

Jacob: In terms of AI education, what other areas are you focusing on? Like is there progress in math and other subjects, or what other areas are you considering? linguistics nowXi has already achieved results. I'm curious to see if there are real breakthroughs in other areas as well.

Connor: I think the advantage will be less in other subjects until one day AI performance improves significantly. Because the thing about language learning is that it does require a 1 to 1 teacher classroom model, and 30 students doesn't work that well.

Jacob: I have personal experience with this.

Connor: Yes. So AI needs to significantly improve performance, which will take a long time. I'm not sure the current technology is up to the task for other subjects. We may need more skills in certain subjects, such as reasoning skills in mathematics. I'm not entirely sure yet. But I think the bar is much higher anyway to create something that people will actually enjoy using. In contrast, using AI to learn languages is significantly more effective than other methods.

So it’s really about timing. For language learning, even if AI doesn’t advance further in the future, we can still create many other things based on existing technology and build better experiences. But it takes time.

Patrick: I don’t know if you saw the math demo of GPT-4o, it was amazing, they were live-streaming videos of them solving math problems, and people were talking about it. But your point is that there's still a gap between live human teaching and AI teaching, even with the ability to know that the AI knows what you're doing and can provide guidance in real time.

Connor: Yes, there are probably millions of people using ChatGPT to help with their homework. So I'm not sure I'd go out and develop a homework aid right now, but if we're trying to fundamentally improve the way we teach math beyond just helping with homework, that's a deeper solution. This solution may not be a technical limitation, but more dependent on whether you can actually create some qualitative leap in the tool so that people are willing to adopt it.

And people don’t just learn mathematics casually in their spare time. So the requirements are even higher, because you need to sell your math products to schools. Maybe you could also market tutoring classes or other products to parents. But I think a lot of times, developing a really good product and finding the right market is the hardest part. The key issue is that this is not a technical issue, but a product and market matching issue.

Jacob: What tools will you develop for language learning next?

Connor: We will be focusing on language learning for a while. We still have a lot of work to do. The next thing to be developed will be some tools related to this. We then need to carefully consider what investments to make in products for schools, individuals and businesses.

Online Consultation