Research fraud (plagiarism, data faking, etc)

Okay, I’m sprouting this off the wtf science thread, and I’m tired of putting these in the Excel thread… (though seriously, I’m cross-posting this in the Excel thread, too)

No data? JUST USE AUTOFILL!

Last year, a new study on green innovations and patents in 27 countries left one reader slack-jawed. The findings were no surprise. What was baffling was how the authors, two professors of economics in Europe, had pulled off the research in the first place.

The reader, a PhD student in economics, was working with the same data described in the paper. He knew they were riddled with holes – sometimes big ones: For several countries, observations for some of the variables the study tracked were completely absent. The authors made no mention of how they dealt with this problem. On the contrary, they wrote they had “balanced panel data,” which in economic parlance means a dataset with no gaps.

“I was dumbstruck for a week,” said the student, who requested anonymity for fear of harming his career. (His identity is known to Retraction Watch.)

The student wrote a polite email to the paper’s first author, Almas Heshmati, a professor of economics at Jönköping University in Sweden, asking how he dealt with the missing data.

In email correspondence seen by Retraction Watch and a follow-up Zoom call, Heshmati told the student he had used Excel’s autofill function to mend the data. He had marked anywhere from two to four observations before or after the missing values and dragged the selected cells down or up, depending on the case. The program then filled in the blanks. If the new numbers turned negative, Heshmati replaced them with the last positive value Excel had spit out.

The student was shocked. Replacing missing observations with substitute values – an operation known in statistics as imputation – is a common but controversial technique in economics that allows certain types of analyses to be carried out on incomplete data. Researchers have established methods for the practice; each comes with its own drawbacks that affect how the results are interpreted. As far as the student knew, Excel’s autofill function was not among these methods, especially not when applied in a haphazard way without clear justification.

This is not a proper way to do imputation.

5 Likes

Just to get it out of the way: Retraction Watch is going to be a heavy source for this thread (in that, that’s the original place I saw the links)

Top Harvard Medical School neuroscientist Khalid Shah allegedly falsified data and plagiarized images across 21 papers, data manipulation expert Elisabeth M. Bik said.

In an analysis shared with The Crimson, Bik alleged that Shah, the vice chair of research in the department of neurosurgery at Brigham and Women’s Hospital, presented images from other scientists’ research as his own original experimental data.

Though Bik alleged 44 instances of data falsification in papers spanning 2001 to 2023, she said the “most damning” concerns appeared in a 2022 paper by Shah and 32 other authors in Nature Communications, for which Shah was the corresponding author.

Shah is the latest prominent scientist to have his research face scrutiny by Bik, who has emerged as a leading figure among scientists concerned with research integrity.

She contributed to data falsification allegations against four top scientists at the Dana-Farber Cancer Institute — leading to the retraction of six and correction of 31 papers — and independently reviewed research misconduct allegations reported by the Stanford Daily against former Stanford president Marc T. Tessier-Lavigne, which played a part in his resignation last summer.

Bik said that after being notified of the allegations by one of Shah’s former colleagues, she used the AI software ImageTwin and reverse image searches to identify duplicates across papers. Bik said she plans on detailing the specifc allegations in a forthcoming blog post.

In interviews, Matthew S. Schrag, an assistant professor of neurology at Vanderbilt University Medical Center, and Mike Rossner, the president of Image Data Integrity — who reviewed the allegations at The Crimson’s request — said they had merit and raised serious concerns about the integrity of the papers in question.

Shah did not respond to a request for comment for this article.

Blasphemy!! How dare you impugn the integrity of the institution known as EXCEL!

:headshake2:

1 Like

I need to copy over some of the Autofill results from the Excele meme thread (at least I think it’s in there)

The thing is, Excel is embedding AI and other stuff in there now, and people will be doing this crap and there will be nothing there but numbers.

(putting together another free ASOP 23 video for YouTube…)

1 Like

We are still recovering from the replication crisis caused by people not understanding the limitations p values and hypothesis testing.

Now, who knows what kind of magic trust people will place in “AI.”

1980's-90's Gifs | A kind of magic, Save the queen, Freddie mercury

1 Like

Reminds me of this…

I don’t think that the Women’s Health Initiative was actually fraud but there are some serious issues with it. Starting to think the folks who do medial studies need some actuaries to review their data.

This podcast is part of a larger discussion where physicians are trying to undo the damage it’s done.

Well, I’m not necessarily wanting the standard “Science WTFery” which is just poorly done science – I know which one you’re talking about. I don’t think anybody was doing what I call science fraud there – just they over-interpreted the results of the particular study/studies in that case. I remember that one.

What I’m trying to pull into this one (and I’m going to copy some of the older pieces over from other threads) are what look like deliberate frauds: faking up data, plagiarizing the work, just outright deception in the research. Not simply sloppy research or people just doing it incorrectly.

The ones I’m really interested in more aren’t the obviously bullshit “sciences” like psychology/sociology (and not going to apologize for that one). That has so much shoddy work there as “foundations”, there is a reason they have poor reputations to begin with.

I am more interested in frauds going on in medicine, chemistry, physics, engineering, etc. That is some scary shit in many ways.

1 Like

I’m going to collect some of the older posts from other threads here.

So, let me collect the Ariely/Gino data fraud posts – detected by the Data Colada guys.

I’m just going to link to the original posts in situ:

Research Fraud - data fakery - David Sholto

1 Like

On the “using Excel’s autofill for data imputation” paper

A paper on green innovation that drew sharp rebuke for using questionable and undisclosed methods to replace missing data will be retracted, its publisher told Retraction Watch.

Previous work by one of the authors, a professor of economics in Sweden, is also facing scrutiny, according to another publisher.

As we reported earlier this month, Almas Heshmati of Jönköping University mended a dataset full of gaps by liberally applying Excel’s autofill function and copying data between countries – operations other experts described as “horrendous” and “beyond concern.”

For example, a student in my introductory statistics class once surveyed 54 classmates and was disappointed that the P-value was 0.114. This student’s creative solution was to multiply the original data by three by assuming each survey response had been given by three people instead of one: “I assumed I originally picked a perfect random sample, and that if I were to poll 3 times as many people, my data would be greater in magnitude, but still distributed in the same way.” This ingenious solution reduced the P-value to 0.011, well below Fisher’s magic threshold.

Ingenious, yes. Sensible, no. If this procedure were legitimate, every researcher could multiply their data by whatever number is necessary to get a P-value below 0.05. The only valid way to get more data is, well, to get more data. This student should have surveyed more people instead of fabricating data.

I was reminded of this student’s clever ploy when Frederik Joelving, a journalist with Retraction Watch, recently contacted me about a published paper written by two prominent economists, Almas Heshmati and Mike Tsionas, on green innovations in 27 countries during the years 1990 through 2018. Joelving had been contacted by a PhD student who had been working with the same data used by Heshmati and Tsionas. The student knew the data in the article had large gaps and was “dumbstruck” by the paper’s assertion these data came from a “balanced panel.” Panel data are cross-sectional data for, say, individuals, businesses, or countries at different points in time. A “balanced panel” has complete cross-section data at every point in time; an unbalanced panel has missing observations. This student knew firsthand there were lots of missing observations in these data.

That statement is striking for two reasons. First, far from being a “few” missing values, nearly 2,000 observations for the 19 variables that appear in their paper are missing (13% of the data set). Second, the flexibility of using two, three, or four adjacent values is concerning. Joelving played around with Excel’s autofill function and found that changing the number of adjacent units had a large effect on the estimates of missing values.

And conversely, you can’t throw away or ignore data. My one stats Prof drilled that into us.

Stupid outliers.

I mean, you can throw it away for certain fittings – but acknowledge you did that.

So, general practice:

A few years ago, there was a proposal by the International Committee of Medical Journal Editors arguing that every paper published in the top journals should make the raw data available. That proposal was shot down because people were worried about their careers, and that other researchers would take their data and use it to make breakthroughs before them. Sharing is the solution. You should have to make all the data available whenever you publish medical research.

It shouldn’t just be a situation re: medical research.

This was in response to indications of the data fraud in cancer research (And those papers are getting retracted). Let’s be serious - nobody much cares re: the “squishy studies” papers, because everybody figures those are bullshit to begin with.

But one finds out that data in cancer research is being faked? What?

But back to the “somebody else may make a breakthrough before I do!” – maybe they can make it standard that one has to credit the data originators on the papers (but they will know it’s just the data, not my analysis! Yes, that is true.) Maybe the data originators get right of first publication, or something or a certain amount of time in certain journals. Figure something out.

2 Likes

If I don’t get to cure cancer then I guess no one does

The continuing saga of Francesca Gino

https://www.wsj.com/us-news/education/harvard-investigation-francesa-gino-documents-9e334ffe

Harvard Probe Finds Honesty Researcher Engaged in Scientific Misconduct

A newly released document details the university’s investigation into behavioral scientist Francesca Gino, once a star faculty member

article within

By

Nidhi Subbaraman

Follow

March 14, 2024 7:53 pm ET

A Harvard University probe into prominent researcher Francesca Gino found that her work contained manipulated data and recommended that she be fired, according to a voluminous court filing that offers a rare behind-the-scenes look at research misconduct investigations.

It is a key document at the center of a continuing legal fight involving Gino, a behavioral scientist who in August sued the university and a trio of data bloggers for $25 million.

The case has captivated researchers and the public alike as Gino, known for her research into the reasons people lie and cheat, has defended herself against allegations that her work contains falsified data.

The investigative report had remained secret until this week, when the judge in the case granted Harvard’s request to file the document, with some personal details redacted, as an exhibit.

The investigative committee that produced the nearly 1,300-page document included three Harvard Business School professors tapped by HBS dean Srikant Datar to examine accusations about Gino’s work.

They concluded after a monthslong probe conducted in 2022 and 2023 that Gino “engaged in multiple instances of research misconduct” in the four papers they examined. They recommended that the university audit Gino’s other experimental work, request retractions of three of the papers (the fourth had already been retracted at the time they reviewed it), and place Gino on unpaid leave while taking steps to terminate her employment.

“The Investigation Committee believes that the severity of the research misconduct that Professor Gino has committed calls for appropriately severe institutional action,” the report states.

HBS declined to comment.

The investigative report offers a rare look at the ins and outs of a research misconduct investigation, a process whose documents and conclusions are often kept secret.

Dorothy Bishop, a psychologist at the University of Oxford whose work has drawn attention to research problems in psychology, praised the disclosure. “Along with many other scientists, I have been concerned that institutions are generally very weak at handling investigations of misconduct and they tend to brush things under the carpet,” Bishop said. “It is refreshing to see such full and open reporting in this case.”

Harvard started looking into Gino’s work in October 2021 after a group of behavioral scientists who write about statistical methods on their blog Data Colada complained to the university. They had analyzed four papers co-written by Gino and said data in them appeared falsified.

An initial inquiry conducted by two HBS faculty included an examination of the data sets from Gino’s computers and records, and her written responses to the allegations. The faculty members concluded that a full investigation was warranted, and Datar agreed.

In the course of the full investigation, the two faculty who ran the initial inquiry plus a third HBS faculty member interviewed Gino and witnesses who worked with her or co-wrote the papers. They gathered documents including data files, correspondence and various drafts of the submitted manuscripts. And they commissioned an outside firm to conduct a forensic analysis of the data files.

The committee concluded that in the various studies, Gino edited observations in ways that made the results fit hypotheses.

When asked by the committee about work culture at the lab, several witnesses said they didn’t feel pressured to obtain results. “I never had any indication that she was pressuring people to get results. And she never pressured me to get results,” one witness said.

According to the documents, Gino suggested that most of the problems highlighted in her work could have been the result of honest error, made by herself or research assistants who frequently worked on the data. The investigative committee rejected that explanation because Gino didn’t give evidence that explained “major anomalies and discrepancies.”

Gino also argued that other people might have tampered with her data, possibly with “malicious intent,” but the investigative committee also rejected that possibility. “Although we acknowledge that the theory of a malicious actor might be remotely possible, we do not find it plausible,” the group wrote.

The committee sent their findings to Datar in March 2023.

Harvard placed Gino on administrative leave in June 2023, after the university completed its probe. A few days later, the Data Colada bloggers published their criticisms of the four papers Gino co-wrote in a series of posts that stunned the psychology and behavioral sciences community.

She sued the university in August, arguing that the investigation was flawed and biased.

After the investigation, the journals pulled the remaining papers the investigative committee had recommended be retracted—bringing the total to four retractions.

In court filings and public statements, Gino and her attorney have denied wrongdoing.

“The silver lining is that people can see for themselves that this investigation was a charade,” Andrew Miltenberg, Gino’s attorney, said in an emailed statement. “Harvard found no evidence that Prof. Gino modified data, not a single co-author or research assistant interviewed believed she did it, and their own forensics firm did not claim they proved Prof. Gino’s guilt.”

“I do take integrity seriously,” Gino wrote in a submission to the committee dated Nov. 11, 2022, included in the report. “I have not manipulated nor fabricated data, and I’ve not written papers that intend to mislead readers with the way studies are described.”

The lawsuit has generated a surge of support for the bloggers, with other researchers launching a GoFundMe to support their legal fees. That fund is approaching $400,000.

Harvard and Data Colada will make arguments to dismiss Gino’s claims at a hearing scheduled for April 26.

The Data Colada scientists—Joe Simmons, Leif Nelson and Uri Simonsohn—declined to comment.

Melissa Korn contributed to this article.

It isn’t included in this thread (maybe a different one), but here is a profile of the guys of Data Colada:

https://www.wsj.com/science/data-colada-debunk-stanford-president-research-14664f3

The Band of Debunkers Busting Bad Scientists

Stanford’s president and a high-profile physicist are among those taken down by a growing wave of volunteers who expose faulty or fraudulent research papers

article within

By Nidhi Subbaraman

An award-winning Harvard Business School professor and researcher spent years exploring the reasons people lie and cheat. A trio of behavioral scientists examining a handful of her academic papers concluded her own findings were drawn from falsified data.

It was a routine takedown for the three scientists—Joe Simmons, Leif Nelson and Uri Simonsohn—who have gained academic renown for debunking published studies built on faulty or fraudulent data. They use tips, number crunching and gut instincts to uncover deception. Over the past decade, they have come to their own finding: Numbers don’t lie but people do.

“Once you see the pattern across many different papers, it becomes like a one in quadrillion chance that there’s some benign explanation,” said Simmons, a professor at the Wharton School of the University of Pennsylvania and a member of the trio who report their work on a blog called Data Colada.

Simmons and his two colleagues are among a growing number of scientists in various fields around the world who moonlight as data detectives, sifting through studies published in scholarly journals for evidence of fraud.

At least 5,500 faulty papers were retracted in 2022, compared with 119 in 2002, according to Retraction Watch, a website that keeps a tally. The jump largely reflects the investigative work of the Data Colada scientists and many other academic volunteers, said Dr. Ivan Oransky, the site’s co-founder. Their discoveries have led to embarrassing retractions, upended careers and retaliatory lawsuits.

Neuroscientist Marc Tessier-Lavigne stepped down last month as president of Stanford University, following years of criticism about data in his published studies. Posts on PubPeer, a website where scientists dissect published studies, triggered scrutiny by the Stanford Daily. A university investigation followed, and three studies he co-wrote were retracted.

Stanford concluded that although Tessier-Lavigne didn’t personally engage in research misconduct or know about misconduct by others, he “failed to decisively and forthrightly correct mistakes in the scientific record.” Tessier-Lavigne, who remains on the faculty, declined to comment.

The hunt for misleading studies is more than academic. Flawed social-science research can lead to faulty corporate decisions about consumer behavior or misguided government rules and policies. Errant medical research risks harm to patients. Researchers in all fields can waste years and millions of dollars in grants trying to advance what turn out to be fraudulent findings.

The data detectives hope their work will keep science honest, at a time when the public’s faith in science is ebbing. The pressure to publish papers—which can yield jobs, grants, speaking engagements and seats on corporate advisory boards—pushes researchers to chase unique and interesting findings, sometimes at the expense of truth, according to Simmons and others.

“It drives me crazy that slow, good, careful science—if you do that stuff, if you do science that way, it means you publish less,” Simmons said. “Obviously, if you fake your data, you can get anything to work.”

The journal Nature this month alerted readers to questions raised about an article on the discovery of a room-temperature superconductor—a profound and far-reaching scientific finding, if true. Physicists who examined the work said the data didn’t add up. University of Rochester physicist Ranga Dias, who led the research, didn’t respond to a request for comment but has defended his work. Another paper he co-wrote was retracted in August after an investigation suggested some measurements had been fabricated or falsified. An earlier paper from Dias was retracted last year. The university is looking closely at more of his work.

Experts who examine suspect data in published studies count every retraction or correction of a faulty paper as a victory for scientific integrity and transparency. “If you think about bringing down a wall, you go brick by brick,” said Ben Mol, a physician and researcher at Monash University in Australia. He investigates clinic trials in obstetrics and gynecology. His alerts have prompted journals to retract some 100 papers with investigations ongoing in about 70 others.

Among those looking into other scientists’ work are Elisabeth Bik, a former microbiologist who specializes in spotting manipulated photographs in molecular biology experiments, and Jennifer Byrne, a cancer researcher at the University of Sydney who helped develop software to screen papers for faulty DNA sequences that would indicate the experiments couldn’t have worked.

“If you take the sleuths out of the equation,” Oransky said, “it’s very difficult to see how most of these retractions would have happened.”

Training by accident

The origins of Data Colada stretch back to Princeton University in 1999. Simmons and Nelson, fellow grad-school students, played in a cover band called Gibson 5000 and a softball team called the Psychoplasmatics. Nelson and Simonsohn got to know each other in 2007, when they were faculty members in the business school at the University of California, San Diego.

The trio became friends and, in 2011, published their first joint paper, “False-Positive Psychology.” It included a satirical experiment that used accepted research methods to demonstrate that people who listened to the Beatles song “When I’m Sixty-Four” grew younger. They wanted to show how research standards could accommodate absurd findings. “They’re kind of legendary for that,” said Yoel Inbar, a psychologist at the University of Toronto Scarborough. The study became the most cited paper in the journal Psychological Science.

When the trio launched Data Colada in 2013, it became a site to air ideas about the benefits and pitfalls of statistical tools and data analyses. “The whole goal was to get a few readers and to not embarrass ourselves,” Simmons said. Over time, he said, “We have accidentally trained ourselves to see fraud.”

They co-wrote an article published in 2014 that coined the now-common academic term “p-hacking,” which describes cherry-picking data or analyses to make insignificant results look statistically credible. Their early work contributed to a shift in research methods, including the practice of sharing data so other scientists can try to replicate published work.

“The three of them have done an amazing job of developing new methodologies to interrogate the credibility of research,” said Brian Nosek, executive director of the Center for Open Science, a nonprofit based in Charlottesville, Va., which advocates for reliable research.

Nelson, who teaches at the Haas School of Business at the University of California, Berkeley, is described by his partners as the big-picture guy, able to zoom out of the weeds and see the broad perspective.

Simonsohn is the technical whiz, at ease with opaque statistical techniques. “It is nothing short of a superpower,” Nelson said. Simonsohn was the first to learn how to spot the fingerprints of fraud in data sets.

Working together, Simonsohn said, “feels a lot like having a computer with three core processors working in parallel.”

The men first eyeball the data to see if they make sense in the context of the research. The first study Simonsohn examined for faulty data on the blog was obvious. Participants were asked to rate an experience on a scale from zero through 10, yet the data set inexplicably had negative numbers.

Another red flag is an improbable claim—say a study that said a runner could sprint 100 yards in half a second. Such findings always get a second look. “You immediately know, no way,” said Simonsohn, who teaches at the Esade Business School in Barcelona, Spain. Another telltale sign is perfect data in small data sets. Real-world data is chaotic, random.

Any one of those can trigger an examination of a paper’s underlying data. “Is it just an innocent error? Is it p-hacking?” Simmons said. “We never rush to say fraud.”

To keep up with their blog and other ventures, the trio text almost daily on a group chat, meet on Zoom about once a week and email constantly.

Simonsohn’s phone pinged in August while he was on vacation with his family in the mountains of Spain. Simmons and Nelson broke the news that they were being sued for defamation in a $25 million lawsuit.

“I was completely dumbfounded and terrified,” Nelson said.

‘She’s usually right’

Bad data goes undetected in academic journals largely because the publications rely on volunteer experts to ensure the quality of published work, not to detect fraud. Journals don’t have the expertise or personnel to examine underlying data for errors or deliberate manipulation, said Holden Thorp, editor in chief of the Science family of journals.

Thorp said he talks to Bik and other debunkers, noting that universities and other journal editors should do the same. “Nobody loves to hear from her,” he said. “But she’s usually right.”

The data sleuths have pushed journals to pay more attention to correcting the record, he said. Most have hired people to review allegations of bad data. Springer Nature, which publishes Nature and some 3,000 other journals, has a team of 20 research staffers, said Chris Graf, the company’s research integrity director, twice as many as when he took over in 2021.

Retraction Watch, which with research organization Crossref keeps a log of some 50,000 papers discredited over the past century, estimated that, as of 2022, about eight papers have been retracted for every 10,000 published studies.

Bik and others said it can take months or years for journals to resolve complaints about suspect studies. Of nearly 800 papers that Bik reported to 40 journals in 2014 and 2015 for running misleading images, only a third had been corrected or retracted five years later, she said.

The work isn’t without risk. French infectious-disease specialist Didier Raoult threatened to sue Bik after she flagged alleged errors in dozens of papers he co-wrote, including one touting the benefits of hydroxychloroquine to treat Covid-19. Raoult said he stood by his research.

Honest work

Simonsohn got a tip in 2021 about the data used in papers published by Harvard Business School professor Francesca Gino. Her well-regarded studies explored moral questions: Why do some people lie? What reward[ drives others to cheat](The Cheater's High: The Unexpected Affective Benefits of Unethical Behavior - Article - Faculty & Research - Harvard Business School? What factors influence moral behavior?

The three scientists examined data underlying four studies and identified what they said were irregularities in how some entries appeared. Numbers in data sets look to have been manually changed. In December 2021, they sent their discoveries to Harvard, which conducted its own investigation.

Harvard concluded Gino was “responsible for ‘research misconduct,’” according to her lawsuit against Harvard, Nelson, Simmons and Simonsohn. The Harvard Business School asked journals that published the four papers to retract them, saying her results were invalid.

In June this year, the trio posted their conclusions about Gino’s studies on Data Colada. Data in four papers, they said, had been falsified. When they restored what they hypothesized was correct information in one of the four studies, the results didn’t support the study’s findings. The posts sent the social sciences community into an uproar.

Gino is on administrative leave, and the school has begun the process of revoking her tenure. In her lawsuit, Gino said Harvard’s investigation was flawed as well as biased against her because of her gender. A business school spokesman declined to comment. The suit also contends that the Data Colada blog posts falsely accused her of fraud. The three scientists said they stood by their posted findings.

Gino, through her lawyer, denied wrongdoing. She is seeking at least $25 million in damages. “We vehemently disagree with any suggestion of the use of the word fraud,” said Gino’s lawyer Andrew Miltenberg. Gino declined to comment.

Miltenberg said Gino was working on a rebuttal to Data Colada’s conclusions.

In August, a group of 13 scientists organized a fundraiser that in a month collected more than $300,000 to help defray Data Colada’s legal costs.

“These people are sending a very costly signal,” Simmons said. “They’re paying literal dollars to be, like, ‘Yeah, scientific criticism is important.’”

Seems like there’s plenty of fraud in research, peer review, and journal publication.

Institutions appear to be badly broken.

I suppose when there’s so much riding on a professor’s ability to get published / noticed it’s inevitable. But unfortunate.