News center > News > Headlines > Context
The most complete map of AI Coding: How Agent will subvert software
Editor
2024-12-17 16:03 5,851

The most complete map of AI Coding: How Agent will subvert software

Image source: Generated by Unbounded AI

Investment in the field of Coding is under two core changing trends, the continued progress of the underlying inference engine and changes in the market space, so it deserves our attention. As an inference engine, LLM is the best application scenario for coding: the logic of the code is clearer than natural language, and the execution results can be automatically verified by AI. Therefore, we see that from Sonnet 3.5 to o1 pro, every improvement in model capabilities will be reflected in improvements in coding capabilities, and the application progress in this field is particularly significant.

The continuous advancement of capabilities makes us expect more users besides professional developers to participate in the software production process. This study provides an overview of the directions and companies in the AI ​​coding field. In the research framework, in addition to products for professional (professional, including corporate and independent) developers, there are also products for citizen (newbies, general knowledge workers) developers. Companies, their product focus is different.

For professional developers, we expect that coding can evolve from copilot to agent faster than other vertical directions, and that the two can coexist. At this stage, copilot, which has excellent product experience such as Cursor and Windsurf, has greatly improved the work experience and productivity of independent developers. Enterprises have huge codebases, complex engineering contexts, and strict security compliance requirements, which are difficult to solve quickly in the short term. Therefore, the agent will be implemented in specific use cases such as testing, review, and migration. At the same time, the general copilot IDE can still play its value and become the working environment of the coding agent.

For citizen developers, we look forward to disruptive changes in the way software is produced and the emergence of a new generation of product form, task engine. AI coding can realize the long-tail needs of more knowledge workers and generate disposable apps: next-generation software that can be used and thrown away. Each app does not need to satisfy a large number of DAU, but has its own personalized experience for different users. , the content of which can even be generated in real time. Innovation at the UI/UX interaction level is necessary. Products with lower interaction thresholds can allow more users to express their personalized needs and form a new generation OS at the coding agent level. By analogy with PC history, current products are still in the command line era, and we look forward to the "GUI moment" in the direction of AI coding.

💡 Table of Contents 💡01 Investment Thesis02 State of AI coding landscape 03 Open Discussion

01.Investment Thesis

Investment in the coding field is mainly facing two major changes, the continuous progress of the underlying inference engine and changes in the market space:

1) Coding capabilities are the fastest evolving direction of capabilities under the LLM + RL paradigm, and will be the first area to gradually evolve from copilot to agent.

Under the RL paradigm, an automated verification environment and a clear reward model are important prerequisites for improving reasoning capabilities. Coding is the most suitable scenario to meet this requirement, and it is likely to be the first scenario where LLM moves from copilot to agent. The release of o1 pro once again validates this thesis.

2) Task engine will expand the market from 50 million developers to 500 million knowledge workers, from professional citizen developers.

There are only 50 million professional developers, and 99% of people in the world cannot write code. But in fact, many people have personalized task requirements that are suitable for standardization by software. In the past, the cost of trial and error in the software industry was too high, but the AI-generated task engine can bring a lot of such software.

Imagine the next generation of the Internet. What users enter in the browser URL box is no longer a URL, but a natural language prompt, and the required content will be generated and rendered in real time; every time the next generation operating system opens , will get the most suitable content based on the user's recent behavior and time environment. This may be an opportunity for Google in the AI ​​era, becoming a new entrance to the cyber world.

The above are the fundamental reasons why we are optimistic about investment opportunities in the field of AI coding. Next, we will introduce the classification coordinate system of several types of products emerging in the AI ​​coding landscape:

• Y-axis definition standard: To what extent does the product require humans in the loop? The top product is more capable of completing tasks end-to-end, while the bottom product is more focused on empowering developers;

• X-axis definition standard: How strong is the development capability of the product’s target users? . The products on the left tend to serve professional developers, while the products on the right tend to serve users with no programming foundation.

3) Copilot for pro & Agent for citizen are areas where model capabilities now match product requirements.

• Lower left quadrant, Copilot for pro dev:

Recently popular products such as Cursor and Codeium Windsurf are basically in this quadrant. They have gained an excellent reputation in the early adopter developer circle, and their in-depth understanding of user needs and Claude Sonnet 3.5's excellent intent understanding ability are indispensable here.

The key issue for them to maintain growth is that excellent product experience can be appreciated by indie developers, but this is only a necessary condition for enterprise GTM, not a sufficient condition. From product to enterprise trust, there are many complex requirements such as privacy/compliance. At the same time, Github Copilot, which has recently awakened and started to accelerate iteration, will also compress their market space.

• Upper right quadrant, Agent for citizen dev:

The product form of coding agent for ordinary people has not yet been fixed: several front-end deployments and web IDE unicorns in the coding field have been launched Our own products, Vercel V0, bolt.new can intelligently generate front-end web apps, and Replit agent can interact with users in multiple rounds to generate basic software; while early startups have more innovative product forms, websim In simulating real-time generation of Chrome, Wordware uses notion-like interactions to enable users to create software within its products.

But the opportunities in this quadrant may have just begun, because the current products are still stuck to the previous ideas. It can be compared to the command line era before the innovative GUI of Xerox Labs. The threshold for use by mass users was very high and stayed in the early adopter circle.

• Upper left quadrant, Agent for pro dev:

To realize this vision, the current model capabilities must continue to be improved. There are >5 companies in Europe and the United States with financing scale of over 100 million US dollars working on this field, because the consumption of code tokens is very large and there are many engineering problems to be solved. The most critical issue is the issue of enterprise codebase context:

On the one hand, how to do accurate retrieval from a huge codebase is a difficult problem. In large technology companies, there are often codes that old engineers have left after they left their jobs. The problem is that the project is difficult to understand. AI can theoretically have a longer context window, but the current understanding and search accuracy is not enough. On the other hand, the enterprise codebase includes a large number of internal business processes within the company.Service logic requires proprietary data fine-tuning and even on-prem deployment. The number of developers in large enterprises like MS Stanley/Coca Cola is no less than that of Google and Meta, and their requirements for compliance and privacy are higher than the technology itself.

• Lower right upper limit, copilot for citizen dev:

There are already relatively mature solutions in this field, so this quadrant will not appear in the company mapping below. There have been many successful products in previous generations of low-code/RPA, as well as listed companies such as UIPath and unicorn companies such as Retool. But they all stay in the copilot stage, and their abstract methods can only assist some users' fixed workflows.

On the contrary, excel has become the best no-code product, helping most knowledge workers to achieve many scientific calculations and statistical tasks. This is an interesting historical revelation. The "Excel" faced by this generation of AI Coding products is ChatGPT, a product with 500 million MAU. How to bypass their user base and strongest model is a problem that entrepreneurial teams need to think about and keep iterating. .

02.State of AI coding landscape

According to the above thesis, We have mapped startup companies in the entire AI programming landscape:

• Copilot for pro: According to the development workflow, it can be divided into Coding, Testing, Code review and Code search. The core value is still concentrated in the entry-level coding part.

• Agent for pro: There are two types of companies in this field, Coding agent and coding model companies. The biggest difference between them is whether the model is developed from scratch. The former builds workflows and agents based on top LLM, while coding model companies train coding-specific models from scratch. Among them, the latter category is not optimistic because it is on the main channel of LLM Company.

• Agent for citizens: Companies in this field have not yet significantly converged, and we can divide them into three categories. The first type is task engine, which implements Proto for users to complete tasks.Type's Task engine companies; the second type is front-end web page generation; the third type is low-code products that use "Lego" components to build applications. In the end, everyone's goal may be task engine, but now everyone has chosen a different route to bet on.

Copilot for pro

• Coding representative companies: Anysphere (Cursor), Codeium, Augment

Product

Focus on programming experience Products can be divided into two categories: IDE and VSCode extension. Both have their own advantages: making your own IDE allows you to have complete product freedom and user data accumulation, while making VSCode Extension is more agile and has lower user migration costs.

The Cursor team made a smart choice here, gaining the advantages of both options by forking VSCode. Codeium is also moving in the direction of IDE through Windsurf, because IDE is still a better entry-level product. It can accumulate data by itself and has more room for feature modification, which is crucial for the product to build its own barriers.

Cursor spends a lot of effort on user experience to achieve "fast" and next action prediction. In this way, the user's development process is to keep pressing Tab and enter a positive cycle of getting feedback quickly (fast = fun, entering flow). Their acquisition of Supermaven last month was to maximize the "fastness" of the product experience; this means that the focus in the short term is still on the synchronous collaboration between humans and AI, and asynchronous interactions like o1 are not yet on the main product line.

The Codeium team started from VSCode extension to IDE, which reflects the difference in thinking from the cursor team. Cursor puts more emphasis on programming experience and recognition of the user’s next intention, while Codeium’s new product Windsurf puts more emphasis on high automation. Their Chat function is more complete than Cursor, and many users can complete basic development without using hands-on code.

At the same time, their products also reflect a stronger understanding of enterprise-level needs, supporting on-prem proprietary models and various compliance protocols. Here we should mention the huge difference in their GTM strategies.

Market

According to the latest report from Sacra, the ARR of the Cursor product has reached $65M, which is about 300,000 paying users. becauseCursor's products do not pay attention to enterprise-level codebase, and their core users are still Silicon Valley indie hackers. The key bet for their future is whether indie hackers can increase the proportion of all developers: If the number of independent developers reaches 5 million under the development paradigm of AI products, it is equal to 10% of all developers today, and the cursor market space to reach a billion dollars.

The growth patterns of enterprise BD and the developer market are different, and Codeium has a strong ability to sign orders for enterprise GTM. Because enterprise data compliance requirements are not the smoothest product experience, but are in the direction that enterprises care about, such as security and compliance. In an exclusive interview with Latent Space, Anshul proposed the concept of enterprise infra native, emphasizing that to be a Fortune 500 user, one needs to break through the mindset of the Silicon Valley developer circle:

• Security: Need to support multiple deployments Options such as self-host or hybrid deployment; containerized deployment (Docker, Kubernetes) is key to ensure data isolation of customer environments.

• Compliance: Enterprises are highly sensitive to the training data used by LLM and need to prove that no copyrighted or unauthorized data is used; data cleaning and data source tracking ensure compliance.

• Personalization: Data quality directly determines the effect of personalization, and the timeliness and relevance of the data need to be evaluated. , helping companies write higher quality code through fine-tuning/RAG. Data preprocessing and role-based access control (RBAC) are key to avoid data leakage due to information integration.

• ROI analysis: The ROI of generative AI is difficult to quantify. By providing usage data of sub-teams, we help customers optimize usage effects and prove value.

• Scale: The enterprise environment is complex and large-scale (such as tens of thousands of code bases, tens of thousands of developers), and needs to solve the problems of large-scale indexing and latency management. The system design needs to be efficient and stable under the conditions of high user volume and high data volume.

The opportunities on the enterprise side may be clearer low-hanging fruits, but the competition they face is Github Copilot's extremely strong distribution channel. When facing competition, it is crucial to use research to address areas where Github may not be doing well.

Research

These coding companies cannot just be regarded as application layer product companies, but companies integrating research and products. Cursor’s official website calls itself applied resarch lab, and Codeium’s official website blog writes:Augment has also done a lot of exploration in the directions of Retrieval, RL and so on.

Augment and Codeium care more about enterprise-level technical solutions, especially problems that Github Copilot currently cannot solve well. For example, Augment is solving the problem of how to accurately perform retrieval and interactive understanding in tens of thousands of enterprise codebases. Similar to the problems encountered by enterprise document RAG, codebase retrieval requires retraining a dedicated embedding model, and the embedding required for dialogue, completion, and cross-file generation are different. Codeium also thinks a lot about deploying dedicated coding models on enterprise on-prem/VPC to achieve a balance between security and intelligence.

For Cursor, asynchronous collaboration with strong reasoning capabilities is the core direction of research, which corresponds to the internal project shadow workspace. The shadow workspace is a development space designed by Cursor for the backend coding agent. This space needs to be able to see the lint prompt information caused by agent modifications, and fully interact with the LSP protocol behind the IDE, but it does not modify the user's original files. AI and users will work together to decide whether to proceed with the next iteration based on the Lint feedback under the shadow workspace. This process is similar to o1 inference time compute.

Shadow Workspace early architecture diagram

• Testing representative companies: QA Wolf, Momentic, Gru AI

Coding testing is a workflow that all developers must go through. to ensure the accuracy of the code. There are two common ones: one is unit test. In complex systems, unit test is needed after the code is updated to verify its availability and reduce the probability of unexpected crashes; the second is in front-end or application development, where it is necessary to Each UI function is interactively tested.

The compatibility between the test task and codegen is also very high because: the writing process is highly repetitive and regular; and this is a task that human engineers are not willing to work on. The unit test coverage of most teams is relatively low, which can illustrate this point.

At the same time, the Cursor team mentioned in the interview that debugging is quite difficult for the LLM base. The training method of LLM is not toIt is difficult for LLM to understand the subsequent significant impact of a seemingly minor error. So there are opportunities for independent startups here.

Among the representative companies in this field: QA Wolf is a company that existed before the emergence of LLM, and can implement many test cases with a Rule-based method; Momentic is a project that has recently emerged from YC and AI Grant. It prefers human and AI collaboration to test the visual product UI; Gru AI designs a dedicated agent for the Unit Test scenario to meet the testing needs of the end-to-end enterprise.

Momentic feature from homepage

• Code Review & Refactor Representative company: CodeRabbit

Code Review & Refactor is a relatively important quality assurance work in the work of developers. Whether you are an enterprise or an independent developer, you need to take the time to review PR requests within and outside your organization. According to Techcrunch, 50% of enterprise developers spend 5 hours a week on code review related work.

CodeRabbit, the representative company here, has achieved $100M+ ARR in less than a year. It is the most installed AI app on GitHub and GitLab. It has reviewed more than 3 million PRs, indicating that LLM-native The product can already provide good services in this field.

At the same time, there are larger CI/CD tasks that can be classified as code refactoring, optimizing and reconstructing code projects to solve the technical debt left in the organization, and may even require technical architecture changes. Refactoring and migrating is Code Migration. Such requirements are also very onerous and human engineers are reluctant to complete them. Therefore, this became the earliest scenario where the Coding agent company we will introduce next gets PMF.

Agent for pro

• Coding agent representative companies: Cognition (Devin), Factory

The average financing size of Coding agent companies is the largest, because these companies The goal is to replace human developers end-to-end. Implementing this process requires a large amount of engineering and consumes a large number of code tokens. The current situation in this field may have two problems to be solved:

1) Technically, the underlying reasoning capabilities of the model are insufficient. Completely solving problems in large enterprise codebases requires strong reasoning capabilities to understand the context at both ends of the user and the codebase, and then decompose the task into multiple solution steps. Such long-context + long-horizon reasoning capabilities can truly solve complex engineering problems on the enterprise side.

2) In terms of products, the UI/UX layer requires innovation in the way of collaboration with humans. Since the model's capabilities have not yet reached full usability, how to involve human in the loop into collaboration is a difficult question: when the model encounters a difficult problem, should it choose inference-time compute for search or leave it to the user to intervene for more information? More guidance and context? If this problem is not solved well, the result may be that the AI ​​itself finds itself helpless after working for 12 hours, and it is difficult for the user to make corrections based on the AI.

Due to the above problems, we speculate that the scenarios where PMF can actually be used now are tasks such as code migration, code refactoring and PR commit. These tasks are often toil for developers, that is, things they are unwilling to do. Developers who focus on these tasks can later focus on other more creative areas. Therefore, the coding agent currently does more things from 1 to 100, and has not yet reached the task of going from 0 to 1. We are optimistic that coding agents will gradually assume more responsibilities in the next two years, but that will require the joint progress of the capabilities of the underlying model and the upper-layer agent framework.

The pricing issue is also worth thinking about: the pricing of traditional dev tools is generally seat-based payment based on the number of users. For coding agents, consumption-based payment may be more reasonable pricing. Excellent and complete The price of a coding agent for a large number of tasks may be worth the same order of magnitude as a junior developer.

• Coding Model representative companies: Poolside, Magic

The requirements for model capabilities in the Coding field are not completely consistent. For example, code tokenizer has different requirements for variables, symbols, and function names in the code. To do specialized segmentation, the training of mainstream LLM takes text generation as the main objective function, and coding ability seems to be a by-product of its intelligence. Therefore, some independent companies began to appear in the field of Coding model:

• Magic emphasizes a particularly long context window, which can completely read the complex codebase in the enterprise to solve problems, and try to avoid retrieval;

• Poolside emphasizes RL from machine feedback, which can integrate the complex engineering chains in Git history Complete solution.

However, companies in this field are on the main track of OpenAI and Anthropic. Considering that coding capabilities are the proxy with the best model reasoning capabilities, the models of these two companies will definitely be under the LLM + RL paradigm. Continue to improve in coding abilities.

Autopilot for citizens

• Task engine representative products: Replit, Websim, Wordware

Task engine corresponds to search engine, and what users get is no longer based on Search web pages for keywords, but software generated based on user needs. We define it as a task engine because we hope to weaken the high threshold connotation brought by the words software and code. The killer app brought by coding capabilities should be something that the majority of users have the willingness and ability to use.

Anthropic Artifact and OpenAI Canvas also hope to achieve this goal to a certain extent, but their product forms are not very easy to use, and users’ expectations for the main product are still chatbots. Therefore, the current task engine is still in the command line era, and a GUI-level product innovation is needed to allow more users to understand and use it.

The current form of this type of product is actually very different:

• Replit agent: cloud IDE for coding agent. The product uses chat for multiple rounds of dialogue interaction. Each action is performed progressively. When encountering problems, the user will be asked questions to supplement the context and clarify requirements. The dialogue process is similar to the process of developers and product managers discussing requirements. This product design idea uses alignment to solve the reliability problem during multi-step execution of the model, but it also requires users to think clearly about their own needs.

• Wordware: Notion for LLM app. The product has a high degree of completion, and the user experience is more like creating content. It has found a good first shot of viral growth through Twitter mocking bots. The use of Twitter as a starting point for growth is reminiscent of Perplexity last year. Become the fastest growing PRAfter the oductHunt product, traffic began to decline rapidly. High reliance on head traffic products is Wordware's current challenge.

• Websim: Using a simple UI similar to Google Chrome, it creates a product that allows users to create and consume web apps at the same time. This product has a lot of room for imagination, and users can continue to generate and modify it based on the user's template, which is a bit like Canva's template idea. Moreover, each hyperlink of the website created by the user on Websim can continue to be clicked and generate a new website in depth. However, the design details of their products still need to be polished. It is similar to C.ai, which has a good framework but is not extreme enough in terms of products.

• Representative companies for front-end generation: Vercel (V0), Stackblitz (Bolt.new)

Companies in this field have previously had a relatively deep accumulation in the field of front-end framework and deployment. For example, Vercel is the inventor of the Next.js framework. Its main business is the deployment of front-end websites. Its ARR has reached more than 100 million US dollars. We have conducted detailed research before. Their Vercel V0 product has been making progress, and both the aesthetic style and the dialogue modification experience are much better than when they were first released. Another popular product is bolt.new, which can also turn product requirements into good web app products in just one sentence. The progress is faster, and the same prompt can see improvements in the quality of its generation every two weeks.

The effect of this type of product is close to usable, but there will be some problems when it is actually used continuously. The demo generated by the web app is very good, but due to its complicated technology stack, the generated product will be difficult to maintain and manage after it really starts to grow in scale. Therefore, the future use case generated by the front end is likely to be a desposable application. This web page is generated to serve a certain temporary, long-tail demand, and does not need to be maintained for a larger group or a longer period of time.

The front-end is a relatively easy area to get started in development, so each previous generation of Low-code/No-code will tell the story of front-end democratization, giving birth to companies like Wordpress, Wix, Spacesquare and even Shopify. The entire market is large, but demand is relatively fragmented, and the concentration of market leaders is not high. AI front-end generation looks promising right now, but can they capture a lot of incremental demand or even replace past stock demand? This directly determines the upper limit of its market size.

03.Open Discussion

1) The democratization of coding capabilities? No, it is the democratization of software engineering.

Technology will bring about the democratization of a field , this is a story that has happened in many fields. For example, Canva has become a one-stop template and design platform for designers. Will there be such a democratization story in the development field?

We think? Coding Ability will not be democratized, but software engineering will. This means that the manufacturing cost of software will drop significantly with AI coding, but users do not necessarily need to understand how code programming is operated and executed, they only need to understand the high level. Run logic immediately. In other words, users do not need to be developers, but they need to be product managers of their own needs

2) UI/UX: synchronous and asynchronous, GUI The time has come.

The future development experience may be composed of synchronous and asynchronous: for the synchronous part, developers write code while AI does code testing/review/optimization synchronously in the background; for asynchronous part, o1 The inference-time compute under the paradigm can decompose the coding task into multiple sub-tasks, reason out the most appropriate solution and verify it yourself.

The same will be true for AI coding in a broader sense in the future, which is the case for many software. real-time based on context To generate real-time, and truly complex tasks do not require interaction, AI can be completed asynchronously and synchronized to the user through email and other methods

The threshold for interactive use of current products is still high, similar to that before the arrival of GUI. Command line moment. When waiting for new interactions to arrive, the space for AI applications will be opened, and the coding field may be the first to be verified and sensed.

Keywords: Bitcoin
Share to: