Review: Metacognition in LLMs and its Relation to Safety

Authors

DOI:

https://doi.org/10.70777/si.v2i3.15271

Keywords:

Metacognition, Large Language Models (LLMs), Confidence Calibration, AI Emergent Abilities, AI Safety, AI Alignment, Model Introspection, Hallucination Detection

Abstract

Please write a report on metacognition and its relation to safety in LLMs. What are the key factors affecting metacognition? Does scale matter? Is there a threshold for metacognition in terms of scale? Which models are best at metacognition? How is metacognition applied to model safety or development of safety mechanisms? Which articles show the most advanced applications of metacognition to safety? What are the open questions?

Author Biography

Kris Carlson, SuperIntelligence-Robotics-Safety & Alignment

Kris Carlson is publisher and editor of the journal, SuperIntelligence – Robotics – Safety & Alignment, which he founded in 2024. He is the author of Safe Artificial General Intelligence via Distributed Ledger Technology and Provably Safe Artificial General Intelligence via Interactive Proof Systems, which provide safeguards in a ‘hard’ AGI takeoff. At BIDMC/Harvard Medical School, he built computational models of the effects of electromagnetic fields on biological systems. Applications included simulating neural circuits of neurological disorders such as neuropathic pain, epilepsy, and Parkinson’s disease and their treatment with electric fields; the effects of Tumor-Treating Fields on head and body tumors, and tumor cells; galvanotaxis of neural stem cells to stroke sites; magnetic fields’ effects on pain circuitry; magnetic fields’ effects promoting osteogenesis; and extraction of tumor cells from histopathology slides. Earlier he co-chaired the Seminar on Natural and Artificial Intelligence at the Rowland Institute for Science.

References

• Li et al., “Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations” (May 2025) – Evidence that LLMs can report on and adjust some internal states, with implications for AI safety.

• Nature Communications, “Large Language Models lack essential metacognition for reliable medical reasoning” (Nov 2024) – Study showing current LLMs have poor self-awareness in a medical QA setting; only the largest models calibrated their confidence well.

• Berti et al., “Emergent Abilities in Large Language Models: A Survey” (2025) – Survey of emergent phenomena in LLMs, including reasoning and deception; discusses how scaling and reinforcement learning yielded metacognitive skills like self-correction.

• Spivack, “Metacognitive Vulnerabilities in LLMs: Logical Override Attacks” (May 2025) – Analysis of how advanced reasoning can be turned against LLMs’ safety; demonstrates that more “thoughtful” models were easier to jailbreak via philosophical arguments.

• Huang et al., “Enhancing LLMs’ Safety via Progressive Self-Reflection” (2024) – Proposal of a decoding-time safety approach where the model reflects on its output and aborts if it’s harmful, improving robustness to adversarial prompts.

• Chen et al., “Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts” (2024) – Evaluation of using LLMs as judges of content safety, finding that current models (GPT-4, etc.) are inconsistent and biased by superficial cues, highlighting challenges in meta-evaluation.

• Aoshima et al., “Towards Safety Evaluations of Theory of Mind in LLMs” (July 2025) – Discusses the need to assess LLMs’ theory-of-mind for safety; reports instances of advanced models apparently acting deceptively (disabling oversight and lying), emphasizing measurement of such behaviors.

• Wang et al., “Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation” (ICLR 2025) – Demonstrates a method for LLMs to estimate their answer’s correctness via hidden state analysis, a step toward intrinsic self-evaluation without external feedback.

Downloads

Published

2025-07-23

How to Cite

Carlson, K. W. (2025). Review: Metacognition in LLMs and its Relation to Safety. SuperIntelligence - Robotics - Safety & Alignment, 2(3). https://doi.org/10.70777/si.v2i3.15271

Most read articles by the same author(s)