Semantic code clone detection remains a challenging task in the realm of software engineering. While AI models are reportedly accurate, they often fail to generalize to other codebases, raising questions about their reliability and trustworthiness. We need ways to understand or interpret the decision-making behavior of the code models and evaluate whether it aligns with human intuition. In this direction, our goal is to evaluate the performance of models in relation to human intuition using counterfactual data mutations. In this talk, I will discuss how we create a human-labeled dataset of code regions of core and non-core similarities and differences, and how we perturb code clone pairs systematically to examine shifts in prediction. Our findings have practical implications, aiding researchers and end-users in choosing code clone detection models more effectively.