In the field of artificial intelligence (AI), Stanford University professor Li Feifei is known as the "Godmother of AI".
She played a major role in the deep learning revolution, working for years to create the ImageNet dataset and competition, which challenged AI systems to recognize 1,000 categories of objects and animals. In 2012, a neural network named AlexNet won the championship in the ImageNet competition, and its outstanding performance shocked the entire artificial intelligence research community.
Since then, neural networks have begun to make breakthroughs, driven by the vast amounts of free training data available on the Internet and GPUs, which provide unprecedented computing power.
In the 13 years since ImageNet emerged, computer vision researchers have mastered object recognition and begun working on image and video generation techniques. Feifei Li co-founded the Stanford Institute for Human-Centered Artificial Intelligence (HAI) and continues to advance the development of computer vision. Just this year, she launched World Labs, a startup that generates 3D scenes that users can explore. World Labs works to give artificial intelligence “spatial intelligence,” the ability to generate, reason within, and interact with a 3D world.
Yesterday, Li Feifei delivered a keynote speech titled "From Seeing to Doing: Ascending the Ladder of Visual Intelligence" at NeurIPS, the top artificial intelligence conference, explaining her vision for machine vision.
Keynote speech link:
https://neurips.cc/virtual/2024/invited-talk/101127
Before the speech, Li Feifei accepted the IEEE An exclusive interview with Spectrum Senior Editor Eliza Strickland. The content is as follows:
Eliza Strickland: Why was the title of the speech "Ascending the Ladder of Visual Intelligence"?
Li Feifei: I think, intuitively, intelligence has different levels of complexity and advancement. In my speech, what I wanted to express is that it is amazing what we have learned about visual intelligence in the past few decades, especially the decade or so of the deep learning revolution. Our technical capabilities are getting stronger and stronger. I was also inspired by Judea Pearl’s “Ladder of Causality”.
The speech also has a subtitle, "FromSee done”. People don’t know enough about this: neither animals nor AI Agents, "seeing" are closely related to interaction and "doing". This is different from language. Language is fundamentally a communication tool used to convey ideas. In my opinion, these are very complementary. Also profoundly intelligent modalities
ES: You mean that we react instinctively to certain sights?
ES: p>
Li Feifei: I'm not just talking about instinct. If you look at the evolution of perception and the evolution of animal intelligence, there is a deep connection between the two. When you have more information, the power of evolution drives the development of abilities and intelligence. If you cannot perceive your environment, your relationship with the world will be very different. Very passive; whether you eat or be eaten, it's a very passive behavior. But once you can take cues from the environment through perception, the evolutionary pressure really increases to promote the development of intelligence.
< p>ES: Do you think this is how we create deeper machine intelligence?Li Feifei: I don't know if "deep" is the adjective I want to use. I think we are creating more capabilities. I think it is becoming more and more sophisticated, I think, more and more capable. The issue of spatial intelligence is the foundation and key step towards comprehensive intelligence. I am convinced of this.
ES: I have seen it. Demonstration from World Labs. Why do you want to study spatial intelligence and build these 3D worlds?
Li Feifei: I think spatial intelligence is the direction of development of visual intelligence and combine it with visual intelligence. There is a very simple and obvious fact about doing something: the world we live in is not flat. Our physical agents, whether robots or devices, will live in a 3D world. The virtual world is also becoming more and more 3D. If you talk to artists, game developers, designers, architects and doctors, even if they are working in a virtual world, most of them are 3D. If you can stop and realize this is simple yet profound. , then there is no doubt that cracking the 3D intelligence problem is fundamental.
ES: I'm curious about how the scene World Labs shows maintains object persistence and obeys the laws of physics. An exciting development because like Sora Such video generation tools are still exploring these things.
Li Feifei: Once you recognize the 3D nature of the world, many things happen naturally. For example, in a video we posted on social media. , the basketball is dropped into a scene. Because it's 3D, you have this ability. If the scene was just 2D generated pixels, the basketball would have nowhere to go.
ES: Or, just. Like in Sora,It might appear somewhere but then disappear. What are the biggest technical challenges you face in trying to advance this technology?
Li Feifei: No one has solved this problem, right? It's very, very difficult. In the demo video from World Labs, you can see that we took a Van Gogh painting and generated the entire scene around it in a unified style: the art style, the lighting, even what kind of architecture the neighborhood would have. If you turned around and it turned into a skyscraper, it would be completely unconvincing. It had to be 3D. You have to navigate it. So it's not just pixels.
ES: Can you talk about the data you used to train it?
Li Feifei: A lot.
ES: Do you face technical challenges in terms of computing power burden?
Li Feifei: There is a huge demand for computing power. This is something the public sector cannot afford. That's part of the reason I'm excited to do this in a private sector way. This is part of the reason I have been pushing for public sector access to computing power, as my personal experience underscores the importance of innovation and adequate resources.
ES: It would be better if the public sector was empowered, as the public sector generally prefers to acquire knowledge for its own benefit and for the benefit of humanity.
Li Feifei: The discovery of knowledge requires the support of resources. In Galileo's time, it was the best telescopes that allowed astronomers to observe new celestial objects. It was Robert Hooke who realized that a magnifying glass could be improved into a microscope and discovered cells. Every time a new technological tool emerges, it helps in the search for knowledge. Now, in the era of artificial intelligence, technical tools involve computing power and data. For the public sector, we have to recognize this.
ES: Assuming that we can make artificial intelligence systems truly understand the 3D world, what will this bring us?
Li Feifei: It will unleash a lot of creativity and productivity for people. I wanted to design my house in a more efficient way. I know that a lot of medical uses involve understanding a very specific 3D world, which is the human body. We always talk about a future where humans will create robots to help us, but for robots to navigate in a 3D world, they need spatial intelligence as part of their brains. We also discussed virtual worlds that will allow people to visit places, learn concepts, or be entertained. These all use 3D technology, especially hybrid technology, which we call AR. I want to walk through the park with a pair of glasses that tell me about the trees, the paths, and the clouds. I also want to learn different skills through spatial intelligence.
ES: What kind of skills?
Li Feifei: Let me give you a simple example. What should I do if I get a flat tire on the highway? Now, I need to open a "how to change a tire" video. But it would be cool if I could put on the glasses and see what's going on with my car and then be guided through the process. Are you okayTo think about cooking, you might think about carving - fun things to do.
ES: How far do you think we can go in this regard in our lifetimes?
Li Feifei: I think this will happen in our lifetime because the pace of technological progress is very fast. You've seen the changes that the last 10 years have brought. This is certainly a sign of what's to come.
Interview link: https://spectrum.ieee.org/fei-fei-li-world-labsOriginal author: Eliza Strickland, senior editor of IEEE Spectrum, mainly reports on artificial intelligence, biomedical engineering and other topics.