Jose Hernandez Urallo, expert on artificial intelligence: “Human criteria cannot be used to evaluate artificial intelligence” | technology
Jose Hernandez-Orallo (Kennington, London, 51) got his first computer, at the age of 10, in tow. “It was cute,” he recalled, “my brother made a batch of Computer Encyclopedia in installments, and if I finished it, I was in for a raffle.” They won it. “We played, like any kid nowadays, but we also programmed, and we had complete control over the computer. They are not the same as they are now.” Today he is a physician and professor at the Polytechnic University of Valencia, a global expert in the evaluation of artificial intelligence and led the letter published in the journal with 15 other researchers. Sciences They claim the need to “rethink” the evaluation of AI tools in order to move towards more transparent models and see what their true effectiveness is, and what they can and cannot do.
Ask. What do you think of Jeffrey Hinton’s decision to quit his job at Google so he can warn more freely about the dangers posed by artificial intelligence?
Answer. What Hinton says is perfectly reasonable, but I’m a little surprised he’s saying it now, when we’ve been saying the same thing for so long at centers like the Center for the Study of Existential Risk or the Leverholm Center for the Future of Intelligence [ambos de la Universidad de Cambridge y a los que está afiliado]. And I think he’s said similar things before, maybe not as clearly or as loudly. I’m amazed that Hinton now realizes that artificial and natural systems are very different, and what works for one (ability, evaluation, control, morality, etc.) doesn’t have to work for the other, aside from the obvious fact that scale and multiplicity (they can iterate, communicate, and update faster) of humans). But it is very welcome that such an important scientist says it like this and now. A coincidence is very high in stakes, though we may differ on priorities. For example, I don’t think producing false material (texts, images, or videos) is a big problem, because increasing our suspicions and forcing us to compare sources is healthy. I’m more interested in some solutions to the “conformity problem” that allow some countries or political or religious groups to align AI with their own interests and ideology, or censor AI systems in a certain direction. The word “alignment”, which is understood as “unique alignment”, reminds me of very difficult times for humanity.
s. How did you come to artificial intelligence?
R was found. There was another encyclopedia in the house on human evolution. I was fascinated by the intelligence, how it developed, and I wanted to understand it. I also read philosophy books. And with all the parts together, I studied computer science because it was what my brother studied even though artificial intelligence at the time was half a subject. Later I did my thesis at the Department of Logic and Philosophy of Science at the University of Valencia, which had a program more oriented towards the philosophy of artificial intelligence. I was infatuated and didn’t have any other options because we didn’t have the resources. It was a year when I also got to work on what I liked, write a book and do alternative social benefits. Sometimes you don’t choose, one thing goes after the other but in the end I dedicate myself to what I’ve always loved, which is understanding natural and artificial intelligence.
s. What is the evaluation of artificial intelligence systems?
R was found. We know what bicycles or kitchen robots are for, what tasks they can perform, and they are evaluated from the point of view of quality. Until recently, AI systems have been down this path. If they are to classify cats and dogs, what is important is to classify the cats and dogs as best as possible. They were task oriented systems. If you know how to evaluate them, you know if they are serving the task you want them to and how many mistakes they are making. But this is very different from systems like GPT4, which have cognitive capacity.
s. What are these systems like now?
R was found. A system is good if it works for you, if it meets your expectations, if it doesn’t surprise you negatively. Artificial intelligence is a general purpose system. You have to determine what they can do based on how you give them instructions. They are very good but they are not human and it is thought that they will react like a person and this is where the problems begin. They answer with some certainty and you think it’s true. This does not mean that humans always answer correctly, but we are so used to measuring people, seeing if they are reliable or not, and these systems do not work with the intuitions we use with humans.
s. And how can ratings be improved in these general-purpose tools, which are capable of doing so many things?
R was found. Well, it’s something that’s been tried. It’s called skill-based, not task-based assessment. There is a great tradition and science for this type of assessment but many have launched to use the same tests used for humans and try to apply them in artificial intelligence and they are not designed for machines. It’s like using a wall thermometer to measure body temperature, it just won’t work.
s. But is there a way to evaluate AI by capabilities?
R was found. This is what we are trying to develop. For example, GPT4 gives assessment with tests, especially education, college entrance exams, chemistry, physics, language, and a little bit of everything. Trying to compare the result you get with that of humans and saying it’s 70% doesn’t make sense. It may be an indicator but it does not mean that it is above 70% of people. When you apply these tools to humans, you assume a lot of things, that coffee can bring you, for example… Now tell the system to bring you a cup of coffee.
s. So there is no way to evaluate it?
R was found. We can’t measure how they work by tasks because we’re never done. To evaluate a system like this, it is necessary to extract indicators, in this case capabilities, that allow us to extrapolate how the system will function in the future. She doesn’t give a number. We should be able to compare humans and AI systems but it’s being done wrong. It’s a very complex system, but I don’t lose hope. We are what physics was in the fifteenth or sixteenth century. Now it’s all very confusing. It is necessary to break the schemes and the ultimate goal, in decades or centuries, is to reach a series of universal indicators that can be applied not only to humans and artificial intelligence, but also to other animals.
s. Do you understand that it is scary?
R was found. We are beings in the course of evolution and we are only one type of intelligence that can exist. Sometimes we think we are transcendent but we got there through a lot of evolutionary opportunities. The closest thing is a bonobo and there is an important leap because we have acquired language and we think we are a peak in the natural scale which we are not. With artificial intelligence, we ask ourselves where we are. The difference is that our evolution has been given to us and there is enough consensus that we don’t play or anyone starts making new species, but with AI we play and when you play you can burn. We have reached levels of sophistication that games are no joke and should be taken seriously. It’s great, it’s like creating a new world.
s. The authors of the letter propose a roadmap for AI models, presenting their findings in a more accurate manner and making evaluation results available on a case-by-case basis.
R was found. Yes, the level of scrutiny should be higher. In other cases, with training data, algorithm and code, I can implement it but with these systems it is impossible due to computational cost and energy.
s. But could it be more transparent?
R was found. You can be transparent in this process. What we ask is that you be more detailed in your results. Allow access to the details in each of the examples. If there are a million examples, I want the results for every one in a million because I don’t have the ability to reproduce that and not just because I don’t have access to a computer and that limits what is fundamental in science, which is peer review. We don’t have access to the parts where it fails.
s. Is regulation the solution?
R was found. It is necessary but it must be done well. If it is not regulated, there will definitely be setbacks. If you don’t regulate aviation, accidents happen, people lose confidence and the industry doesn’t get off the ground. If something major happens, society’s reaction may be to turn on these systems and in the medium and long term they will have less penetration and use than they might have for tools that are, in general, positive for society. You have to regulate but don’t brake too much. People are afraid of flying but we know that aviation regulations are among the strictest, planes are one of the safest modes of transportation, and companies know it’s good for them in the long run.
s. Can there be regulation for everyone, all over the world?
R was found. There are AEA and recombinant DNA conventions. GMO foods have failed, countries do not approve, and in Europe we consume these foods but cannot make them, and this is what can happen to us. The EU regulation may have errors but you have to put them in and put them into practice.
s. Do you think this regulation should be strict or lenient?
R was found. I think it should be size specific. He must be strict with adults and more lax with children. You can’t demand the same thing from Google as you can from a start Four kids in college because if you don’t kill innovation.
s. Was there a gap between regulation and science again?
R was found. It’s that artificial intelligence is going too fast and there are things that can’t be expected. It is difficult to organize something so transversal and so perceptive. We are slow but we are also lagging behind in social networks and dealing with tobacco forever.
s. Will it shed some light knowing how black boxes work?
R was found. The black squares do not show what the system is doing. To really know what it is, when it fails, and what expectations you have, a lot of evaluation is needed. To evaluate students, we don’t give them a scanner, we give them a test. If we want to know how a car works, we want to know if they tested whether it would go out in a curve or not, and it helps me not to know how many spark plugs it has but to know how many tests they did. This is why the question of evaluation is fundamental. What we want to do is test these systems until we determine the area in which you can use them safely. This is how cars and planes are valued.
s. Why does artificial intelligence create such anxiety?
R was found. Awareness efforts are being made, but the goal is not to understand how they work. The criticism of OpenAI is that it has given access to the most powerful AI system to hundreds of millions of people, including children and people with mental problems, provided they are not responsible and that is the culture we have today. day. We download apps and nobody is responsible. I guess they thought if they didn’t get people to use it how would they know the risks. But experimental tests can be done. They say there is gradual access but it is professional policy. It is a challenge for Google in its search engine business to be the leader. And people are afraid because a few players control everything and it is an oligopoly.
You can follow country technology in Facebook And Twitter Or sign up here to receive The weekly newsletter.
Subscribe to continue reading
Read without limits