Can Large Language Models Explain Their Internal Mechanisms?

This way then shows that by the up purple circle the information about the united kingdom is already fetched to the model.

if you replace x with a residual value of some text and model complete is correctly then you can guess that the information is fetched. There is a question this approach can not answer. assume source prompt is united kingdom and investigation prompt is: Iran: Tehran France: Paris x We have no way of knowing if all information about the united kingdom is already fetched in the residual or only the united kingdom is fetched.