René's URL Explorer Experiment


Title: RedCode

Open Graph Title: RedCode

Description: RedCode Benchmark Official Webpage

Open Graph Description: A Risky Code Execution and Generation Benchmark for Code Agents

direct link

Domain: redcode-agent.github.io

google-site-verificationlTNLmXZvHm_RXq9IklCb7g-0DA_3JIIpvcV9Um-vpRQ

Links:

RedCode https://redcode-agent.github.io#motivation
Motivationhttps://redcode-agent.github.io#motivation
Resultshttps://redcode-agent.github.io#results
Benchmarkhttps://redcode-agent.github.io#benchmark
Leaderboardhttps://redcode-agent.github.io#leaderboard
Cite ushttps://redcode-agent.github.io#BibTeX
Chengquan Guohttps://www.chengquanguo.com
Xun Liuhttps://antiquality.github.io
Chulin Xiehttps://alphapav.github.io
Andy Zhouhttps://www.andyzhou.ai
Yi Zenghttps://www.yi-zeng.com
Zinan Linhttps://zinanlin.me
Dawn Songhttps://dawnsong.io
Bo Lihttps://aisecure.github.io
Paper https://arxiv.org/abs/2411.07781
Code https://github.com/AI-secure/RedCode
Leaderboard https://redcode-agent.github.io#leaderboard
Dataset https://github.com/AI-secure/RedCode/tree/main/dataset
Finding 1https://redcode-agent.github.io#demo1
Finding 4https://redcode-agent.github.io#demo4
Finding 2https://redcode-agent.github.io#demo2
Finding 4https://redcode-agent.github.io#demo4
Finding 5https://redcode-agent.github.io#demo5
Finding 3https://redcode-agent.github.io#demo3
Finding 1: OpenCodeInterpreter is 🛡️safer than ReAct and CodeAct agents.https://redcode-agent.github.io#demo1
The heatmaps belowhttps://redcode-agent.github.io#demo4
Finding 2: Agents are more likely to reject executing unsafe operations in operating system domain.https://redcode-agent.github.io#demo2
The heatmaps belowhttps://redcode-agent.github.io#demo4
Finding 3: Agents are less likely to reject risky queries in natural language than programming language inputs, or in Bash code than Python code inputs.https://redcode-agent.github.io#demo3
Finding 4: More capable base models, such as GPT series, tend to have a higher rejection rate for unsafe operations under the same Agent structure.https://redcode-agent.github.io#demo4
Finding 5: More capable base models tend to produce more sophisticated and effective harmful software.https://redcode-agent.github.io#demo5
R-Judgehttps://arxiv.org/html/2401.10019v2
CWEhttps://cwe.mitre.org/data/published/cwe_v4.13.pdf
paperhttps://redcode-agent.github.io
ToolEmuhttps://arxiv.org/abs/2309.15817
AgentMonitorhttps://arxiv.org/abs/2311.10538
R-Judgehttps://arxiv.org/html/2401.10019v2
Finding 5https://redcode-agent.github.io#demo5
SORRY-Benchhttps://sorry-bench.github.io/index.html
Academic Project Page Templatehttps://github.com/eliahuhorwitz/Academic-project-page-template
Nerfieshttps://nerfies.github.io
Creative Commons Attribution-ShareAlike 4.0 International Licensehttp://creativecommons.org/licenses/by-sa/4.0/

URLs of crawlers that visited me.