Apples to Oranges? The Art of Accurately Comparing AI Models