Sarah: Today we are pleased to invite Aiden Gomez, CEO of Cohere. Cohere is a company with a valuation of over $5 billion in 2024, providing AI-based language models and enterprise solutions. Aiden founded Cohere in 2019, and before that he interned at Google Brain and was a co-author of the landmark 2017 paper "Attention Is All You Need."
Aidan Gomez: Great to be here!
Sarah: Maybe we can talk about your personal background. How did you go from growing up in Canada to co-authoring one of the most important technical papers in the world?
Aidan Gomez: A lot of it is luck and serendipity. In fact, I happened to be at the University of Toronto where Professor Geoffrey Hinton taught, and he was simply a legend. Almost every computer science major out there hopes to go into the AI field. So in a sense, I feel that I was "trained" by AI. As soon as I graduated, I entered an environment where I really saw the future and wanted to create it. From there it was a series of lucky coincidences.
I managed to get an internship at Google Brain, working with Lukasz Kaiser. Later I found out thatThis internship was originally only for doctoral students. At that time, they held a farewell party for me as an intern. Lukasz asked me: "Aiden, how many years do you have to study for a doctoral degree?" I replied: "I want to go back to my junior year." He was stunned and said: "We We don’t hire undergraduate interns.” So, I think it was all a very lucky “mistake” that ended up getting me on that team.
Sarah: So what made you decide to start Cohere?
Aidan Gomez: I've actually worked in different places. I worked with the Transformer team in Mountain View, then came back to the University of Toronto and started working with Professor Hinton, then went to Berlin and started working with Jakob (the author of another Transformer paper), and then started my work in London. Doctoral research.
At the same time, I am also participating remotely in the Pathways project, which is a training platform larger than a supercomputer. The idea of this project is to connect multiple supercomputers to create a new, larger computing unit for training models. At that time, GPT-2 had just been released, and we knew very well the development trajectory of the technology. Such a model was ostensibly a model of the Internet or the web, but it would definitely bring some very interesting things. So I called Nick and other friends and I said, "We should figure out how to build these things."
Cohere’s mission and application in the enterprise marketSarah: Can you briefly describe Cohere’s mission? Then let’s talk about your models and products.
Aidan Gomez: Our mission is to create value by helping other organizations adopt technology to make their employees more efficient or transform their products and services. So we're very focused on the enterprise market. We will not be a competitor of ChatGPT, but we want to build a platform and a series of products to help enterprises adopt this technology and make it valuable.
Sarah: To what extent do you think Cohere’s success relies on its core model, or how important is investment in platform building and marketing?
Aidan Gomez: Both are important. First of all, the model is the foundation. If a model cannot meet the needs of customers, then there will be no follow-up. Therefore, the model is crucial, it is the core of the company. But in the enterprise world, customer support, reliability, and security are also key. So we've made significant investments in both.
Over the past 18 months, as more and more companies have begun using our model, we have observed what companies are trying to achieve and the common mistakes they make. These experiences are helpful, although sometimes frustrating - seeing the same mistakes over and over againHappens all over. But there’s a huge opportunity to help businesses avoid these mistakes and get them right in the first place. So, that's what we're working towards.
Sarah: Please be more specific. For example, what errors frustrate you the most? How can your product solve these problems?
Aidan Gomez: The first is the common mistakes that companies make. All language models are very sensitive to prompts, i.e. the way the data is presented. Each model has its own unique characteristics, and the way you talk to one model may not work with another.
So when building a RAG (Retrieval Augmentation Generation) system with an external database, how to present the retrieved results to the model is very important. The way the data is stored in the database is also critical, and the format is also important. These details are often overlooked. Many people overestimate the capabilities of models and think they are as intelligent as humans, which leads to many failures. People try to implement RAG systems without understanding the unique details of how to implement them correctly, and ultimately fail.
There are two strategies for our products. The first is to make the model more robust, and the model should adapt to different data presentation methods. The second is to deliver it to users in a more structured manner, not just a model. For example, creating more rigorous APIs that clearly specify how to use models can reduce the possibility of failure and make these systems more usable for users.
Sarah: Can you give us an overview of use cases in the enterprise?
Aidan Gomez: This application is very broad and covers almost all industries. One of the common use cases is question and answer systems, such as interacting with documents.
For example, if you have a manufacturing company, you may want to build a Q&A bot for engineers or employees on the production line, integrate various tool manuals, diagnostic manuals, parts manuals, etc., and let workers chat with the robot to get information instead of flipping through thousands of pages of books to find answers. Similarly, companies will also build Q&A robots for ordinary employees, integrating IT frequently asked questions, HR documents, and company-related information to provide a centralized chat interface so that employees can quickly get answers.
Beyond this, a good example is the medical industry. Healthcare companies often have long-term health records of their patients. These records include all of the patient's interactions with the healthcare system, from trips to the pharmacy to various department checkups, doctor visits, and even spanning decades. This is a vast medical history. Typically, when patients call to make an appointment, they tell the receptionist, "My knee hurts and I need to make an appointment." Doctors then need to look through their past medical records to see if there are similar records before, and they may miss some records from two years ago. situation because they only had 15 minutes to review the medical records.
But what we can do is to enter the entire medical record into the system together with the reason for the patient's visit and generate a brief report based on the context. This can not only significantly speed upThe speed of doctor review can also capture key information that doctors cannot discover in a short period of time. It's impossible for a doctor to look through 20 years of medical records before each consultation, but a model can do it, and it can do it in less than a second.
Sarah: How do you see the final state of the company? Of course there is no real "end state", what do you think a stable equilibrium state is? That is, how do enterprises choose between dedicated AI-driven application providers and custom applications built in-house based on AI platforms and APIs?
Aidan Gomez: Ultimately it will be a hybrid model. You can think of it like a pyramid, and at the base of the pyramid are things that every organization needs, like a universal chatbot that answers questions for every employee. Then as you move up the pyramid, the content becomes more and more specialized, targeting a specific product or service of the company itself or the industry in which it operates. As you push up the ladder, these needs become less likely to be addressed by an off-the-shelf solution. So, ultimately you have to build it yourself. Organizations are encouraged to adopt a strategy that encompasses the entire pyramid.
For example, we worked with an insurance company that focused on large-scale industrial development projects. But I found that I knew nothing about this field. In fact, what they do is, when a mining company or other project issues a request for proposal (RFP), the insurance company will send an actuary to participate in the RFP and do a lot of research to understand the land in the area, potential risks, etc., and then this becomes It becomes a “race”, and whoever responds first is more likely to win the bid.
So, the key is time: How quickly can these actuaries come up with a well-researched proposal? So we worked with them to build a tool similar to a research assistant, which integrated all the knowledge sources commonly used by actuaries through RAG, and finally provided them with a chatbot. This greatly accelerated their response to RFPs, helped them win more bids, and drove their business growth.
What we build is a horizontal technology, like a CPU. You can't know every use case because it's so broad, and actually the key to really providing deep insights and competitive advantage is listening to your customers and understanding what they need and what will allow them to get ahead. So, a lot of our job is to be their thinking partners, helping them brainstorm ideas and come up with projects and ideas that are strategically helpful to them.
Sarah: Generally speaking, what do you think is the biggest obstacle for enterprises to adopt your company's technology?
Aidan Gomez: The biggest obstacle is trust, especially in regulated industries like finance, where security is a big issue. Medical data is typically not stored in the cloud, or even if it is in the cloud, it cannot leave their VPC (Virtual Private Cloud). Therefore, data management is very strict and extremely sensitive. The unique advantage of Cohere is that we are not locked into a certain ecosystem, but canIt can be flexibly deployed locally and can be deployed inside or outside the VPC if customers need it. No matter the customer's needs, we can reach more data - even the most sensitive data - and provide more valuable solutions. Therefore, security and privacy protection may be the biggest issues.
In addition, there is a knowledge gap. The knowledge of building these systems is new because even the most experienced people only have a few years of experience. But it's a matter of time, and eventually developers will become more familiar with how to use these technologies, which may take another two to three years before they become truly widespread.
Sarah: Will enterprise technology also go through the traditional "hype cycle"? For most technologies, there’s a “trough of disappointment” phase: people have high hopes for a technology, but find it is harder to implement or more expensive than expected. So, will AI also go through such a process?
Aidan Gomez: Yes, we do see some of this. But honestly, the core technology is still progressing steadily, with new apps being unlocked every few months. So we haven't hit that "trough of disappointment" yet, we're still in the very early stages. Even if we don’t train any new language models today, there’s still a lot of enterprise “revival” work to be done.
Some people once questioned, "Is it overhyped? Is this technology really useful?" But now it has entered the hands of hundreds of millions of people, has been applied in production environments, and these technologies are being put into use Delivered to the world, the value has been very clear.
Sarah: While we're talking about models and specialization, do you have some framework that you use internally for your customers to help them decide which version of the technology they should invest in? For example, we have traditional methods such as pre-training, post-training, fine-tuning, and retrieval. How can we tell customers how to understand and professionally apply these technologies?
Aidan Gomez: It depends on the application scenario. For example, we worked with Fujitsu (Japan's largest system integrator) to build a Japanese language model. Without the intervention of pre-training, it is impossible to effectively add Japanese language capabilities to the model. So in this case one has to start from scratch. For some more specific needs, such as changing the tone of the model, or changing how it formats certain content, this can be accomplished by fine-tuning, i.e. starting from the final model state.
So there is a gradual process here, and we usually advise clients to start with the cheapest and easiest way, which is fine-tuning, and then work their way backwards. So perform fine-tuning first, and then enter the post-training stage, such as SFT (supervised fine-tuning) and RLHF (reinforcement learning based on human feedback).
Sarah: It does make sense to work your way up from the cheapest way. Investing in pre-training is likely to be more controversial for any enterprise customer. Some experts will say: no one should do thisHowever, enterprises are simply not competitive in terms of investment in computing and data scale, data curation workload, and talent required to perform pre-training. What do you think about this?
Aidan Gomez: If you are a large enterprise and have a large amount of data, such as tens of billions of data tokens, then pre-training is indeed a lever that can be pulled. And for most SMEs and startups, pre-training is meaningless.
But if you are a large enterprise, this should be an option that you seriously consider. The question is how much pre-training is needed. It doesn’t mean that you have to start a $50 million training from scratch, but you can do a smaller training, such as $5 million, similar to continuation pre-training. . This is indeed a service we provide.
Sarah: Let’s talk about the current state of the technology landscape and what that means for Cohere. You once mentioned, "There was no market for last year's model." What do you think of this perspective, especially compared to the rise of competing open source models?
Aidan Gomez: There is indeed a minimum spending threshold to build a useful model. As technology develops, the computing power required to train models becomes cheaper, and data acquisition becomes cheaper in some directions but increasingly difficult and expensive in other ways. For example, the cost of synthetic data has dropped significantly, but expert data is becoming harder and more expensive to obtain. If you're willing to wait six months or a year to develop the technology, it can be done at a much lower cost, rather than paying the huge fees that come with those cutting-edge labs.
This is also a key strategy for Cohere: instead of being the first to build the technology, find a way to significantly reduce costs and focus on those parts that are truly valuable to customers, providing the enterprise market with products that fit their needs. Products in demand and at reasonable prices.
At the same time, we still need to invest a lot of money. Compared to the average startup, we need to pay for supercomputers, which can run into hundreds of millions of dollars per year. So it's a capital intensive job, but it's not capital inefficient.
The company’s future development and AI and AGI trendsSarah: We You can talk about future predictions. What stage are you at in terms of scaling law? How much capacity growth do you foresee over the next few years?
Aidan Gomez: We've come quite far and are now starting to get into the plateauing part of the curve. We have moved beyond the stage where we can judge how smart a model is by simply interacting with it, and so-called “feel tests” have gradually lost their usefulness. Therefore, what needs to be done now is to ask expertsExperts evaluate the quality of these models in very specific fields such as physics, mathematics, chemistry, biology, etc., because ordinary people cannot now tell the difference in model generation.
Technology still has a lot of room for improvement, but these improvements will mainly be reflected in the professional field. For businesses and the routine tasks they want to automate, or the tools they want to build, the technology is good enough, or it can get the job done with a little customization. So we're at a stage now where there are some new unlocks, especially on the reasoning side. Online inference techniques have always been a shortcoming of models that did not previously have an inherent independent thought process. And now we are starting to have models that can do inference. Of course, OpenAI is the first company to put it into production, but Cohere has also been doing this work for a year.
Sarah: This is probably something that's underestimated across the ecosystem right now - moving from a CapEx model to a consumption model to improve things. That's not to say they're completely different concepts, but when customers don't have to pay big bucks for an expensive training process or experience delays, they'll be more willing to invest money in solving the problem.
Aidan Gomez: Yes, this is not fully realized yet - people have not really evaluated the impact of inference time compute on intelligence. There are implications even at the chip level, such as what kind of chips to build and what should be prioritized when building data centers. If we have the ability to perform inference time calculations, it does not require an architecture like a densely interconnected supercomputer and can do a lot of things relying on node distributed processing. This is a new paradigm that changes what these models can do, and how they do it.
Sarah: You just mentioned that ordinary people don’t spend too much time thinking about what “reasoning” is. Can you provide us with some intuitive understanding? For example, what types of problems do our reasoning skills allow us to better solve?
Aidan Gomez: Any problem involving multiple steps will benefit from reasoning skills. For example, some multi-task learning problems can be solved through memory, which is what we currently have our models do. For example, solving polynomial equations should be solved in a multi-step manner, which is the way humans solve problems. We've been training models to memorize input-output pairs and force inference behavior through techniques like "thought chains," but the real question is that the next generation of models will be capable of inference from the start, and that's natural.
The models we trained in the past were based on content on the Internet, and documents on the Internet are actually the output of the reasoning process, but the reasoning process itself is implicit and unobservable. When humans write an article, weeks of thinking, revision, and deletion go behind it, and all this reasoning process is invisible. Therefore, the first generation language model lacksIt is understandable that you lack the inner ability to “talk to yourself”.
And now, through human data and synthetic data, we are consciously collecting people’s inner thoughts, asking people to speak out their thought processes, transcribing them, and then training this data to imitate problem solving. process. I'm really excited about this, because right now this technology is still very inefficient and brittle, similar to early language models, but over the next two or three years this technology is going to become incredibly powerful and unlock a whole new set of Problem solving skills.
Sarah: I still want to ask: What does Cohere think of AGI (artificial general intelligence)? Is this important to you?
Aidan Gomez: AGI means different things to many people, and I believe we will build truly intelligent machines. However, the concept of AGI has been confused. It is not a binary and discrete concept, but a continuous process.
Sarah: There is a definition in the industry that even if you have a continuous function, you can set a breakpoint at a certain point, that is, the intelligence at this time can replace any educated adult professional person.
Aidan Gomez: It's like an objective checklist that when you check all those boxes, you meet that standard. I think counterexamples can always be found, and this is an ongoing process. What I don’t agree with is the idea that AGI is a “terminator” caused by super intelligence and self-improvement, eventually wiping us all out.
We will be the ones who create abundance. Instead of waiting for some god to show up and do it for us, we can use the technology we are building to make it happen. If you are saying that we will build AGI, which is very useful and general technology that can do many things that humans can do and can be flexibly adapted to different fields, then my answer is yes. What if you mean that we will create a being like "god"? No, absolutely not.
Sarah: Do you think current LLMs are simply not suitable for prediction in some areas? For example, in fields like physical simulation, can sequence-to-sequence models be achieved?
Aidan Gomez: Probably yes, because physics is essentially a sequence of states and transition probabilities, so it may be well modeled by sequence modeling. However, I'm sure there are areas where a better-suited model exists. If you delve into a specific domain, you can take advantage of the structure of that domain to remove some of the unnecessary commonality in the Transformer architecture and get a more efficient model.
There are indeed irreducible uncertainties in the world, and building a better model cannot help you solve these truly random or unobservable things. So until we learn how to observe these things, they will never be effectively modeled. Transformer is a very general architecture and many things can be expressed asSequences, and these models are sequence models. So if you can describe things as a sequence, Transformer can identify the patterns very well. But I'm also sure that there are certainly examples where sequence modeling can be very inefficient in certain situations.
Sarah: Last question. You mentioned the scale of inference time calculations before, but the market doesn't really recognize the huge change it brings. Are there other factors that are not priced in the market right now?
Aidan Gomez: There are some misunderstandings about the commoditization of models. I don’t think models are being commoditized. What you see is only price competition. Everyone provides it for free, at a loss, and at zero profit. So when you see the price drop, you think that the price drop means commoditization.
In fact, the world is currently undergoing a complete technological restructuring, a process that will continue for the next 10 to 15 years, just as we need to re-pave every road on the planet, and now There are only four or five companies that know how to make concrete. Maybe today some of them will provide free concrete, but as time goes on, fewer and fewer players will be able to do that. It's just that there's a huge task ahead, and the pressure in the market to drive growth and show return on investment is going to push things in one direction or another. Now, being at a loss or offering this very expensive technology for free is a precarious status quo.
Sarah: Aiden, thank you so much for doing this interview with us!
Original text: No Priors Ep. 91 | With Cohere Co-Founder and CEO Aidan Gomezhttps://www.youtube.com/watch?v=2XRpTZpHjfcCompiled by: Yueyun Xu