XAI-Machine Interpretability

All Items

Safety Comparison across Large Language Models

Deliberative Alignment: Reasoning Enables Safer Language Models

Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex, Amelia Glaese

DOI: https://doi.org/10.70777/si.v2i3.15159

Evidence Integrity Before Capability: A Prerequisite for Safe Artificial Intelligence

Jennifer Flygare Kinne

DOI: https://doi.org/10.70777/si.v2i6.16393

Highlights of the Issue: Singapore Consensus – Safety Technology In Progress

Kris Carlson

DOI: https://doi.org/10.70777/si.v2i5.15525

Prompt injection attack success rates over time by frontier model

International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Nestor Maslej, Conor McGlynn, Malcolm Murray, Shalaleh Rismani, Stephen Casper, Jessica Newman, Daniel Privitera, Sören Mindermann, Daron Acemoglu, Thomas G. Dietterich, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Susan Leavy, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Sarvapali D. (Gopal) Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Alvaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya-Qin Zhang

DOI: https://doi.org/10.70777/si.v2i4.16671

Bengio et al-Number of AI-enabled biological tools over time

International Al Safety Report: First Key Update Capabilities and Risk Implications

Yoshua Bengio, Benjamin Bucknall, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, Nestor Maslej, Vasilios Mavroudis, Conor McGlynn, Malcolm Murray, Shalaleh Rismani, Charlotte Stix, Lucia Velasco, Nicole Wheeler, Daniel Privitera, Sören Mindermann, Daron Acemoglu, Thomas G. Dietterich, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Susan Leavy, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Sarvapali D. (Gopal) Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Alvaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya-Qin Zhan

DOI: https://doi.org/10.70777/si.v2i6.16253

Framework for Agentic AI safety and governance

Responsible Agentic Reasoning and AI Agents: A Critical Survey Proposal for Safe Agentic AI via Responsible Reasoning AI Agents (R2A2)

Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis

DOI: https://doi.org/10.70777/si.v2i6.16169

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Soren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King

DOI: https://doi.org/10.70777/si.v2i5.15569

Participants of the ‘2025 Singapore Conference on AI: International Scientific Exchange on AI Safety’, 26th April 2025.

The Singapore Consensus on Global AI Safety Research Priorities Building a Trustworthy, Reliable and Secure AI Ecosystem

Yoshua Bengio, Max Tegmark, Stuart Russell, Dawn Song, Sören Mindermann, Lan Xue, Stephen Casper, Luke Ong, Vanessa Wilfred, Tegan Maharaj, Wan Sie Lee, Ya-Qin Zhang

DOI: https://doi.org/10.70777/si.v2i5.15503

AIDSAFE - multiple agents collaborating to do chain-of-thought safety reasoning - outperforms single-agent, single-shot safety reasoning.

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

DOI: https://doi.org/10.70777/si.v2i3.15249

Dario Amodei, The Adolescence of Technology: Confronting and Overcoming the Risks of Powerful AI

5 May 2026

Anthropic CEO Dario Amodei envisions a 'country of geniuses in a datacenter' with these five enabling properties:1) smarter than top humans across most domains, 2) able to act autonomously over long horizons, 3) use digital tools, 4) coordinate many copies, and 5) operate at much higher speed than humans. You could use these as a checlist to develop such a system, as Anthropic probably does. #5 is true in some domains (real-time learning being a counter-example), #3 is well in progress, and #1, #2, and #4 are still significant obstacles. He does not mention, e.g., robust generalization, abstraction, and world models. The essay discusses risks, governance, and economic implications. The essay’s overall thesis is: AI’s upside remains enormous, but humanity must treat the next few years as a civilizational test requiring technical alignment work, pragmatic regulation, geopolitical realism, economic adaptation, and moral seriousness.

Steve Omohundro: Regulating AGI: From Liability to Provable Contracts

18 November 2025

AGI will render today's liability-based AI regulation obsolete through its ability to circumvent cybersecurity, hide its origins, and act strategically—but it will also enable a new regulatory paradigm based on mathematically provable contracts.

Joe Rogan Experience #2345 - Roman Yampolskiy

24 September 2025

SuperIntelligence co-founding editor Roman Yampolskiy interviewed at length on Joe Rogan. Over 800,000 views.

Steve Omohundro Receives 2024 Future of Life Award

24 September 2025

SuperIntelligence co-founding editor Steve Omohundro was one of three recipients of the prestigious FLI Award 2024 award recognizing seminal contributions to AI safety: "...for laying the foundation of modern ethics and safety considerations for artificial intelligence and computers."

Steve Omohundro and Scientists Discuss the AI Alignment Problem with Neil deGrasse Tyson

24 September 2025

Hosted by Neil deGrasse Tyson, our co-founding editor Steve Omohundro discusses the AI alignment problem starting at ~23:29.

All Items

Deliberative Alignment: Reasoning Enables Safer Language Models

Evidence Integrity Before Capability: A Prerequisite for Safe Artificial Intelligence

Highlights of the Issue: Singapore Consensus – Safety Technology In Progress

International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

International Al Safety Report: First Key Update Capabilities and Risk Implications

Responsible Agentic Reasoning and AI Agents: A Critical Survey Proposal for Safe Agentic AI via Responsible Reasoning AI Agents (R2A2)

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

The Singapore Consensus on Global AI Safety Research Priorities Building a Trustworthy, Reliable and Secure AI Ecosystem

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

Current Issue

Announcements

Dario Amodei, The Adolescence of Technology: Confronting and Overcoming the Risks of Powerful AI

Steve Omohundro: Regulating AGI: From Liability to Provable Contracts

Joe Rogan Experience #2345 - Roman Yampolskiy

Steve Omohundro Receives 2024 Future of Life Award

Steve Omohundro and Scientists Discuss the AI Alignment Problem with Neil deGrasse Tyson

Information