Across colleges and universities, faculty are increasingly harnessing generative AI tools—most prominently large language models (LLMs) such as ChatGPT-4 (OpenAI)—to accelerate key phases of scholarship: literature reviews, data handling, hypothesis generation, and even draft writing.
A recent study by Mehrnaz Mostafapour, PhD, et al., titled “Evaluating Literature Reviews Conducted by Humans Versus ChatGPT: Comparative Study”, published in JMIR AI (2024), compares a GPT-4-led literature review with one conducted by human researchers. The findings offer a useful entry point into how these tools are reshaping research norms and underscore the urgent need for new best practice frameworks around authorship, credit, error mitigation, and peer review.
In the study, the authors asked GPT-4 to conduct a literature review on relational dynamics between physicians and patients in medicolegal contexts using iterative prompt engineering. This was then compared with a human-led review employing a systematic search in Ovid MEDLINE and thematic analysis.
The results showed that GPT-4 generated a broad list of 21 relational factors in seconds, identifying a wide breadth of content and responding rapidly. The human review, while much slower, provided deeper contextualization, greater accuracy, and clearer methodological transparency. For instance, about 14% of GPT-4’s listed factors were judged irrelevant, and 7.5% of its references were categorized as irrelevant by the researchers. The authors concluded that GPT-4 may serve well as a preliminary research assistant but cannot replace expert human scholarship.
For faculty, the study signals opportunity and caution in equal measure. Generative AI can help produce an initial draft of a literature survey, map a field rapidly, and surface connections that might otherwise take weeks to identify. It may free up time for interpretation, synthesis, and writing. Yet it also raises profound questions about how research is done, who deserves credit, how errors propagate, and how peer review must adapt.
Attribution, Credit, and Governance
If a faculty member uses GPT-4 extensively for literature retrieval, drafting text, or generating hypotheses, what is the proper attribution? Traditional norms assume a human author—or author team—drives all substantive intellectual contributions. When AI contributes significantly, should it be named as a “co-author” or simply acknowledged as a tool?
The Mostafapour study treats the model as an assistant and emphasizes human oversight:
“We suggest approaching GPT-4 as a research assistant who possesses limited contextual expertise and occasionally synthesizes responses entirely to overcome the second challenge.”
Still, many journals and institutions are only beginning to define governance policies. If the human researcher shapes nearly every prompt, selects outputs, verifies references, and corrects errors, the researcher remains the author. Yet the efficiency gain and initial content generation come from AI—a new and complex ratio of labor and contribution.
Best-Practice Recommendations
- Full disclosure of the use of generative-AI tools in literature reviews, data work, or drafting.
- Clear delineation of what the human researcher completed (prompt design, verification, interpretation, editing) versus what the model contributed (retrieval, initial synthesis).
- Authorship assigned only to humans who take final intellectual responsibility. AI tools should not receive authorship unless their design or code forms part of the research contribution.
Error Rates, “Hallucinations,” and Peer Review Implications
GPT-4’s speed comes at a cost. The model frequently produces irrelevant results and occasional hallucinations—information that is factually incorrect, fabricated, or misleading. The authors caution that GPT-4 “will not communicate … when the topic has been saturated or knowing when to stop asking for more information.”
From a faculty perspective, this introduces a new kind of risk—errors hidden beneath polished prose. If an AI-assisted draft enters peer review without rigorous human auditing, reviewers may find it harder to trace sources, check for veracity, or catch subtle interpretive mistakes. The opacity compounds the challenge. Human researchers document full search strategies, inclusion and exclusion criteria, and screening flows—steps an LLM cannot replicate. The study notes significantly less transparency in the way that LLMs process prompts, collect information, and generate outputs at this time.
To maintain rigor:
- Treat AI outputs as draft material, not final text.
- Use human experts to verify nearly every reference, assess relevance, and review interpretive depth.
- Report prompt structures, AI versions used, and human oversight steps in the methods section.
- Encourage journals and peer reviewers to ask: “Was AI involved? What steps did the human author take to verify outputs?”
Shifts for Peer Review, Hiring, and Tenure
Beyond authorship, generative AI is transforming the peer-review ecosystem itself. If faculty can generate a first-pass literature review in hours rather than weeks, there may be pressure to accelerate submissions. That shift raises questions about reviewer capacity, expectations, and the overall quality of published work.
Moreover, as AI becomes integrated into scholarly writing, universities and departments may need to revisit hiring, promotion, and tenure guidelines—especially when AI-assisted work is acceptable but human oversight remains central. The Mostafapour study suggests a promising hybrid: “The structured approach to prompt engineering may serve as a model for researchers looking to integrate generative AI into their literature searches.” That hybrid model could serve as a blueprint—AI for speed and breadth, human expertise for nuance and depth.
Using a Best Practice Framework
Drawing from the study, four elements form the initial foundation of a best practice framework for AI use in academic research:
- Transparency: Declare use of the model (version, training date, prompt-engineering steps) and describe how human oversight occurred.
- Verification: Human authors must verify all AI-generated content, including references, facts, and interpretations. Treat AI output as an early draft.
- Authorship clarity: AI is a tool, not an author, unless its creation is central to the research itself. Human authors define and claim their own intellectual contributions.
- Peer review readiness: Manuscripts using AI should include a methods statement on its use, allowing reviewers to assess reliability and verification processes.
The study suggests faculty can responsibly use generative AI in literature reviews—but only with human expertise, oversight, and transparent reporting. Neglecting those safeguards risks errors, misattribution, and erosion of trust in scholarly work.
Generative AI is rapidly embedding itself into faculty workflows, enabling faster reviews, surfacing wider literatures, and supporting hypothesis generation. But it does not replace the human capacity for judgment, interpretation, and accountability. As the technology evolves, research offices, journals, and faculty governance committees must establish clear guidelines around credit, accuracy, and oversight.
The Mostafapour comparative study captures both the promise and the precautions. The takeaway is simple but vital: Treat AI as an accelerator, not a substitute.
Faculty who adopt transparent, verified research practices will be best positioned to use generative AI responsibly—advancing and accelerating scholarship while safeguarding the integrity of authorship, credit, and peer review.