AutoAdminsteredAntidotes: Circuit detection in a poisoned model for MNIST classification
A downloadable project
We trained a simple Convolutional Neural Network on a poisoned version of the MNIST dataset. Some elements of the dataset include a watermark, for which the label has been modified. We describe the process for uncovering the path through the network the watermark takes by method of ablation and poisoning visualization through feature maximization methods. We also discuss applications to safety and further generalizations.
Download
Download
Write up.pdf 1.3 MB
Leave a comment
Log in with itch.io to leave a comment.