XAI-Machine Interpretability

9 Items

Explainable AI (XAI) - Machine Interpretability (MI)

All Items

  • Deliberative Alignment: Reasoning Enables Safer Language Models

    Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex, Amelia Glaese
    DOI: https://doi.org/10.70777/si.v2i3.15159
  • Evidence Integrity Before Capability: A Prerequisite for Safe Artificial Intelligence

    Jennifer Flygare Kinne
    DOI: https://doi.org/10.70777/si.v2i6.16393
  • Highlights of the Issue: Singapore Consensus – Safety Technology In Progress

    Kris Carlson
    DOI: https://doi.org/10.70777/si.v2i5.15525
  • International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

    Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Nestor Maslej, Conor McGlynn, Malcolm Murray, Shalaleh Rismani, Stephen Casper, Jessica Newman, Daniel Privitera, Sören Mindermann, Daron Acemoglu, Thomas G. Dietterich, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Susan Leavy, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Sarvapali D. (Gopal) Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Alvaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya-Qin Zhang
    DOI: https://doi.org/10.70777/si.v2i4.16671
  • International Al Safety Report: First Key Update Capabilities and Risk Implications

    Yoshua Bengio, Benjamin Bucknall, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, Nestor Maslej, Vasilios Mavroudis, Conor McGlynn, Malcolm Murray, Shalaleh Rismani, Charlotte Stix, Lucia Velasco, Nicole Wheeler, Daniel Privitera, Sören Mindermann, Daron Acemoglu, Thomas G. Dietterich, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Susan Leavy, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Sarvapali D. (Gopal) Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Alvaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya-Qin Zhan
    DOI: https://doi.org/10.70777/si.v2i6.16253
  • Responsible Agentic Reasoning and AI Agents: A Critical Survey Proposal for Safe Agentic AI via Responsible Reasoning AI Agents (R2A2)

    Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis
    DOI: https://doi.org/10.70777/si.v2i6.16169
  • Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

    Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Soren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King
    DOI: https://doi.org/10.70777/si.v2i5.15569
  • The Singapore Consensus on Global AI Safety Research Priorities Building a Trustworthy, Reliable and Secure AI Ecosystem

    Yoshua Bengio, Max Tegmark, Stuart Russell, Dawn Song, Sören Mindermann, Lan Xue, Stephen Casper, Luke Ong, Vanessa Wilfred, Tegan Maharaj, Wan Sie Lee, Ya-Qin Zhang
    DOI: https://doi.org/10.70777/si.v2i5.15503
  • Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

    Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris
    DOI: https://doi.org/10.70777/si.v2i3.15249