Okay, I’m sprouting this off the wtf science thread, and I’m tired of putting these in the Excel thread… (though seriously, I’m cross-posting this in the Excel thread, too)
No data? JUST USE AUTOFILL!
Last year, a new study on green innovations and patents in 27 countries left one reader slack-jawed. The findings were no surprise. What was baffling was how the authors, two professors of economics in Europe, had pulled off the research in the first place.
The reader, a PhD student in economics, was working with the same data described in the paper. He knew they were riddled with holes – sometimes big ones: For several countries, observations for some of the variables the study tracked were completely absent. The authors made no mention of how they dealt with this problem. On the contrary, they wrote they had “balanced panel data,” which in economic parlance means a dataset with no gaps.
“I was dumbstruck for a week,” said the student, who requested anonymity for fear of harming his career. (His identity is known to Retraction Watch.)
The student wrote a polite email to the paper’s first author, Almas Heshmati, a professor of economics at Jönköping University in Sweden, asking how he dealt with the missing data.
In email correspondence seen by Retraction Watch and a follow-up Zoom call, Heshmati told the student he had used Excel’s autofill function to mend the data. He had marked anywhere from two to four observations before or after the missing values and dragged the selected cells down or up, depending on the case. The program then filled in the blanks. If the new numbers turned negative, Heshmati replaced them with the last positive value Excel had spit out.
The student was shocked. Replacing missing observations with substitute values – an operation known in statistics as imputation – is a common but controversial technique in economics that allows certain types of analyses to be carried out on incomplete data. Researchers have established methods for the practice; each comes with its own drawbacks that affect how the results are interpreted. As far as the student knew, Excel’s autofill function was not among these methods, especially not when applied in a haphazard way without clear justification.
Just to get it out of the way: Retraction Watch is going to be a heavy source for this thread (in that, that’s the original place I saw the links)
Top Harvard Medical School neuroscientist Khalid Shah allegedly falsified data and plagiarized images across 21 papers, data manipulation expert Elisabeth M. Bik said.
In an analysis shared with The Crimson, Bik alleged that Shah, the vice chair of research in the department of neurosurgery at Brigham and Women’s Hospital, presented images from other scientists’ research as his own original experimental data.
Though Bik alleged 44 instances of data falsification in papers spanning 2001 to 2023, she said the “most damning” concerns appeared in a 2022 paper by Shah and 32 other authors in Nature Communications, for which Shah was the corresponding author.
Shah is the latest prominent scientist to have his research face scrutiny by Bik, who has emerged as a leading figure among scientists concerned with research integrity.
She contributed to data falsification allegations against four top scientists at the Dana-Farber Cancer Institute — leading to the retraction of six and correction of 31 papers — and independently reviewed research misconduct allegations reported by the Stanford Daily against former Stanford president Marc T. Tessier-Lavigne, which played a part in his resignation last summer.
Bik said that after being notified of the allegations by one of Shah’s former colleagues, she used the AI software ImageTwin and reverse image searches to identify duplicates across papers. Bik said she plans on detailing the specifc allegations in a forthcoming blog post.
In interviews, Matthew S. Schrag, an assistant professor of neurology at Vanderbilt University Medical Center, and Mike Rossner, the president of Image Data Integrity — who reviewed the allegations at The Crimson’s request — said they had merit and raised serious concerns about the integrity of the papers in question.
Shah did not respond to a request for comment for this article.
I don’t think that the Women’s Health Initiative was actually fraud but there are some serious issues with it. Starting to think the folks who do medial studies need some actuaries to review their data.
This podcast is part of a larger discussion where physicians are trying to undo the damage it’s done.
Well, I’m not necessarily wanting the standard “Science WTFery” which is just poorly done science – I know which one you’re talking about. I don’t think anybody was doing what I call science fraud there – just they over-interpreted the results of the particular study/studies in that case. I remember that one.
What I’m trying to pull into this one (and I’m going to copy some of the older pieces over from other threads) are what look like deliberate frauds: faking up data, plagiarizing the work, just outright deception in the research. Not simply sloppy research or people just doing it incorrectly.
The ones I’m really interested in more aren’t the obviously bullshit “sciences” like psychology/sociology (and not going to apologize for that one). That has so much shoddy work there as “foundations”, there is a reason they have poor reputations to begin with.
I am more interested in frauds going on in medicine, chemistry, physics, engineering, etc. That is some scary shit in many ways.
On the “using Excel’s autofill for data imputation” paper
A paper on green innovation that drew sharp rebuke for using questionable and undisclosed methods to replace missing data will be retracted, its publisher told Retraction Watch.
Previous work by one of the authors, a professor of economics in Sweden, is also facing scrutiny, according to another publisher.
As we reported earlier this month, Almas Heshmati of Jönköping University mended a dataset full of gaps by liberally applying Excel’s autofill function and copying data between countries – operations other experts described as “horrendous” and “beyond concern.”
For example, a student in my introductory statistics class once surveyed 54 classmates and was disappointed that the P-value was 0.114. This student’s creative solution was to multiply the original data by three by assuming each survey response had been given by three people instead of one: “I assumed I originally picked a perfect random sample, and that if I were to poll 3 times as many people, my data would be greater in magnitude, but still distributed in the same way.” This ingenious solution reduced the P-value to 0.011, well below Fisher’s magic threshold.
Ingenious, yes. Sensible, no. If this procedure were legitimate, every researcher could multiply their data by whatever number is necessary to get a P-value below 0.05. The only valid way to get more data is, well, to get more data. This student should have surveyed more people instead of fabricating data.
I was reminded of this student’s clever ploy when Frederik Joelving, a journalist with Retraction Watch, recently contacted me about a published paper written by two prominent economists, Almas Heshmati and Mike Tsionas, on green innovations in 27 countries during the years 1990 through 2018. Joelving had been contacted by a PhD student who had been working with the same data used by Heshmati and Tsionas. The student knew the data in the article had large gaps and was “dumbstruck” by the paper’s assertion these data came from a “balanced panel.” Panel data are cross-sectional data for, say, individuals, businesses, or countries at different points in time. A “balanced panel” has complete cross-section data at every point in time; an unbalanced panel has missing observations. This student knew firsthand there were lots of missing observations in these data.
That statement is striking for two reasons. First, far from being a “few” missing values, nearly 2,000 observations for the 19 variables that appear in their paper are missing (13% of the data set). Second, the flexibility of using two, three, or four adjacent values is concerning. Joelving played around with Excel’s autofill function and found that changing the number of adjacent units had a large effect on the estimates of missing values.
A few years ago, there was a proposal by the International Committee of Medical Journal Editors arguing that every paper published in the top journals should make the raw data available. That proposal was shot down because people were worried about their careers, and that other researchers would take their data and use it to make breakthroughs before them. Sharing is the solution. You should have to make all the data available whenever you publish medical research.
It shouldn’t just be a situation re: medical research.
This was in response to indications of the data fraud in cancer research (And those papers are getting retracted). Let’s be serious - nobody much cares re: the “squishy studies” papers, because everybody figures those are bullshit to begin with.
But one finds out that data in cancer research is being faked? What?
But back to the “somebody else may make a breakthrough before I do!” – maybe they can make it standard that one has to credit the data originators on the papers (but they will know it’s just the data, not my analysis! Yes, that is true.) Maybe the data originators get right of first publication, or something or a certain amount of time in certain journals. Figure something out.