Deep-Dive-Experiment%3A-Benchmarking-Code-Generation-from-GPT-4-to-Llama-3