Artificial Intelligence Discussion

My guess is that you are using the same equipment (or at least the same type of equipment) on the same populations of fish, etc.

This is different from the x-ray example, in which (I assume) the aim is to have a single tool sold to everybody that works across all hospitals, on all patients.

In short, you want to train and test on the same populations.

If you will forgive me for summarizing your thoughts, it seems to me you are thinking like a scientist. You see a specialized (or bespoke) tool for your fairly specialized research, customized to your use, and imagine all other sorts of specialized research applications as well.

I agree there is great potential here. As another example, I saw an article the other day in which machine learning was used to extract a burnt, lost work on Plato. Machine learning was specially trained to distinguish between the (x-rayed) burnt ink compared to the not-burnt ink.

I do not think this is what the AI companies are after. They want universal tools, used by everybody, with little training, because that is what brings in the huge investments to build the infrastructure needed.

These universal tools are going to have to ā€œscaleā€ efficiently enough to bring in the needed returns to justify those investments. Being able to double check humans may not be enough. They may need to replace humans at many jobs.

I think that, in contrast with the huge potential for bespoke, scientific applications, it is very unclear whether these AI applications will have the universality needed to justify these investments, or the hype that is causing the investment.

I’m doing proof of concept work on a new method of aging fish. We’re still convincing ourselves that the method works well across instruments with the same model, but different units.

You could generalize better if you trained the method using x-rays from multiple machines.

I heard one story where the AI figured out that the annotation on the imaging distinguished between positive and negative results. You definitely have to be careful with your training data.

That would be the hope given my profession.

Single instrument being used on a number of different populations of multiple fish species.

Actually, no. We’re trying to figure out how generalizable the method is given the method is very new to us. I know from work in the same realm in the past that there’s been situations where there’s more variability between instruments than from the feature being measured and that you have to be careful making comparisons between research groups.

I’m not sure on this. If I was an x-ray manufacturer, I think I’d loved to have proprietary software that’s designed to work specifically with my product line.

I think the field is too new to make the universal application. There are still lots of landmines to hit and lessons to be learned. The bespoke, proprietary software is achievable now. You take out a lot of the variability you have to deal with for a universal approach. You can also work with you client base to target the application to what they need and to build the training dataset.

2 Likes

I should have said it a little differently.

I know that, for example, medicine often works much less well in practice than in controlled, randomized experiments because there are all sorts of variables related to the people being treated, the way the medicine is administered, the kinds of pathologies being treated, etc.

These are additional variations that do not appear in controlled experiments. This seems to be a big deal when moving applications to the real world, to be used by non-scientists. It sounds like this is not really something you have to worry about.

I think there are far more sources of variation to be concerned about than the machine.

Another example: I remember reading years ago that a person should not do their own lead paint test. I bought a test, did it myself, and it seemed trivially easy. But I realized I am trained to follow directions carefully, exactly.

And yet, that is the claim being made.

Several AI executives have claimed we will have artificial general intelligence in 5 years.

These are the claims I am often thinking about when I talk about the AI ā€œhypeā€.

With medicine, it’s often because they simplify the study group so much to try and minimize the effect of between individual variability. Let’s toss women from the study because their hormones complicate things. Restrict it to males between 30 and 50 with a BMI under 30 and no other issues. Then in the real world you go and give the medicine to people whom are taking multiple drugs, women, old people, young people, etc.

With imaging software, you’re just looking at the image, and you can ignore a lot of those issues. You can stick (and you should stick) everyone in the study.

I’d be wanting to train the machine on that between human variability e.g. the training set should show femurs or mammograms from every possibility of shape and size. Plus every possible location for a break or a tumour. You’d also want a good mix of broken and unbroken or with and without tumours across the size spectra. Your training set is really important and we’ve seen repeatedly where screwing that up messes up your model.

How long has Elon been promising full self driving?

I’d hope by now that people would realize that the general news media is not the place to go to get good reporting on science and tech. Executives are also often bad sources of information on individual products. Their job seems to be to hype up investors rather than provide accurate info. Plus, their information is often from a summarized version of a summarized version of a summary story of the state of something, and the summarizers of the summaries are often written by people who are unfamiliar with the details.

The older I get, the more disillusioned I get with how orgs work. I feel like people like Trump and Musk and a lot of talking heads on TV have legitimized lying without consequences by people in power.

2 Likes

This is like patients going online and reading stuff about their disease and then telling their doctors that they are wrong. Who cares, the radiologist won’t care that some AI tool came up with a different diagnosis. If you want a second opinion, it will take another radiologist about 30 seconds to check if a scan is normal.

This is a good quote, and it applies to a lot of new technology. However it’s also outdated. That was 2021, when practical AI was still basically brand new. A lot of work has since been done to reach production, including work by Stanford to generalize their Chest-X-ray model. To give you an idea, the FDA has approved of a lot more AI devices since that quote than before that quote.

It would be interesting to see what he would say today. Generally speaking, I like Andrew, because he is one of the few experts who hasn’t bought the hype (as much).

But isn’t the issue that you cannot ever really include all sources of variability?

Experimental setups can remove variability, or add back individual sources in a controlled way. This is exactly what you want to do if you are trying to study the constants of nature.

How sensitive are the deep neural nets to this variability? It isn’t clear. However, we know that neural nets in general tend to be sensitive to outliers.

More generally, we do not really have any science about how this works. Nobody can say, for example, why a human child can learn to recognize a dog after seeing an handful of them, while computer vision models must see thousands, or hundreds of thousands. Instead, we have traditional models that engineers have approximated and scaled to work with very large data sets. And people are extrapolating the additional things that they can do.

I’ve given some thought to all of this and as of now I believe one of the fundamental problems with these studies is that the loss function they’re using are simply incompatible with real life. Without going into the details, the problem has to do with how the algorithms perform in the ā€œtailsā€ or on inputs that are not well represented by the training sample. If you apply say a square loss function, it’s very easy to see that testing AI vs radiologist will favor AI. But if you apply a more realistic loss function where the misdiagnosis is severe in the ā€œtailsā€ then it’s a completely different story. It’s like how Andrew Ngo described the problem in 2021 whereby scans produced by older machines can’t be analyzed properly by AI. So if you assume that such scans are relatively rare, a square loss error will simply discard such errors as ā€œmeaninglessā€. The ability of doctors to crush AI in the tails combined with a realistic loss function that is quite punitive on misdiagnosis can lead to completely different results. In fact, it’s actually possible that the central limit theorem doesn’t apply since the variance doesn’t even exist due to the thickness in the tails. So overall, I am wholly unconvinced by these studies. The same situation applies to applying AI for legal judgments.

I’m simply against AI in medicine, legal, etc. If you have catastrophic loss in classification errors, most of these studies comparing docs vs AI or judges vs AI are pretty useless. The breakdown of the algorithm on inputs far away from the training sample is the key danger. If you apply these algorithms to tasks that dont have catastrophic loss (like death), then it’s really fine. Stuff like classifying spam in your email. What happens if you have a viagra ad pop up in your emails? Nothing, really.

Many predictive models have problems with robustness to outliers.

And models not based on causal principles can have problems generalizing.

I think both of these general properties are related to what you are talking about.

However that doesnt mean these models won’t be useful in some circumstances.

What’s interesting to me is that an outlier to AI can easily be handled by human beings, in fact it’s plausible that outliers to AI are even easier to handle to human beings than the typical input. So in essence my point is that doctors are probably way better at AI on inputs that AI hasn’t seen enough of.

There are many options.

Look, you can send me countless studies and articles, I’m simply not convinced that AI can be safely deployed in medicine. It’s just my opinion. The same applies to legal, self-driving, etc. Go ahead and use it to detect spam, I have no problem with that application.

1 Like

I think it’s tricky.

These machine learning methods are new, and it remains to be soon how they should be used. I share a lot of your concerns. But I tend to think the stakes are high enough in much of medicine that if they help a person catch mistakes occasionally, it’s probably worth it. It doesn’t have to be better than people in that case, because it’s not replacing them. It just hasn’t make different kinds of mistakes.

On the other hand, you have a long history of research going back to, i don’t know, the 1940s and lead by paul meehl showing that statistical models (which he called actuarial models back then) outperform clinical judgement in psychology.

This research has been extended to beyond just psychology. Statistical models outperform people in a lot of applications.

These are more traditional models (think linear models) that are probably be more robust than the latest machine learning models.

I still think the science is lacking with respect to learning as an activity. We do not understand it well.

This is a reasonable assumption depending on the exact application. If you ask an LLM for a diagnosis (not a rafiology device) it might be way off as opposed to a little off. Of course this isn’t beyond measurement, it can be captured by decent science.

This is reasonable too. Also depending on the exact application. Now that Waymo is driving 1 million miles a week, it regularly encounters bizarre ā€˜edge cases’. Otoh, since humans regilarly kill each other with cars, it takes a certain number of these catastrophic failings to make them less safe.

Once they drive enough miles, the statistics of how many people they kill everyday should apply, though I suppose you could argue that there are some unique failure cases, like if someone somehow hacked them all simultaneously.

Anyway, broadly speaking, I’d say your fears are legitimate, but they also have well defined bounds.

Does it? I thought it was being tested only in certain specially selected areas.

1 Like

They have been releasing it to wider and wider areas, and now drive anywhere in San Francisco, for example.

And yes there are numerous examples of edge cases causing problems.

But if it’s driving in an environment that it has already learned really well then that is not really the same problem as a fully autonomous self driving car that can take the place of a human driver.

Depends on the human. If it can take the place of a SF taxi driver (and doesn’t require too much remote assistance) then I would save it is solving a big problem.