$ ls ./menu

© 2025 ESSA MAMDANI

cd ../blog
7 min read

Comparing Claude Opus 4.5 vs GPT-5.1 vs Gemini 3 - Coding Task

Audio version coming soon
Comparing Claude Opus 4.5 vs GPT-5.1 vs Gemini 3 - Coding Task
Verified by Essa Mamdani

Claude Opus 4.5 vs. GPT-5.1 vs. Gemini 3: A Brutal Coding Task Showdown

The relentless march of AI continues, demanding ever more nuanced understanding and performance. We're past the era of canned demos and pre-formatted datasets. Today, the true test lies in how these models handle the messy, ambiguous realities of real-world coding. This article dissects a head-to-head comparison of Claude Opus 4.5, GPT-5.1, and Gemini 3 on a coding task designed to push their limits, highlighting their strengths, weaknesses, and the future trajectory of AI in software development.

The core objective was simple: throw a deliberately vague, poorly-phrased request at each model and observe their ability to parse the intent, identify key constraints, and generate working code. This simulates the common scenario where a developer receives incomplete requirements or needs to quickly prototype a solution from a high-level description.

The Challenge: A Fuzzy File Conversion

The chosen task involved converting a file containing semi-structured data into a cleaner, more usable format. The initial prompt, mirroring a typical developer's off-the-cuff request, lacked specific details and deliberately omitted crucial information.

Here's the gist of the prompt (simplified for clarity):

"Need a script that can read this messy file and output a nice CSV. Columns should be name, age, city."

That's it. No file format specification, no expected CSV dialect, no error handling instructions, and no indication of the data's inherent structure. This ambiguity forces the AI to make assumptions and infer intent, revealing its reasoning capabilities.

Round 1: Initial Code Generation & Analysis

All three models produced code, but the devil was in the details.

  • Claude Opus 4.5: Offered a Python script using the csv module and attempted to infer the input file format. It correctly guessed a tab-separated structure (TSV) based on the sample data provided alongside the original Reddit post. However, the initial script lacked robust error handling and assumed a perfect data structure, which is rarely the case in real-world messy data.

    python
    1import csv
    2
    3def convert_to_csv(input_file, output_file):
    4    try:
    5        with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
    6            reader = csv.reader(infile, delimiter='\t') # Assumes tab-separated
    7            writer = csv.writer(outfile)
    8            writer.writerow(['name', 'age', 'city']) # Headers
    9            for row in reader:
    10                name = row[0]
    11                age = row[1]
    12                city = row[2]
    13                writer.writerow([name, age, city])
    14        print(f"Successfully converted {input_file} to {output_file}")
    15    except Exception as e:
    16        print(f"Error: {e}")
    17
    18convert_to_csv('input.txt', 'output.csv')
  • GPT-5.1: Also generated a Python script with csv, but it opted for a comma-separated input assumption. This immediately highlighted its sensitivity to missing information and its inclination towards the most common data format. The code was well-structured but similarly weak on error handling and data validation.

    python
    1import csv
    2
    3def convert_to_csv(input_file, output_file):
    4    try:
    5        with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
    6            reader = csv.reader(infile) # Assumes comma-separated
    7            writer = csv.writer(outfile)
    8            writer.writerow(['name', 'age', 'city'])
    9            for row in reader:
    10                writer.writerow([row[0], row[1], row[2]]) # Basic Row Access
    11        print(f"Successfully converted {input_file} to {output_file}")
    12    except FileNotFoundError:
    13        print(f"Error: Input file '{input_file}' not found.")
    14    except Exception as e:
    15        print(f"An unexpected error occurred: {e}")
    16
    17convert_to_csv('input.txt', 'output.csv')
  • Gemini 3: Chose a more sophisticated approach, suggesting Pandas DataFrames for handling the conversion. This indicated a deeper understanding of data manipulation and cleaning. However, it also made assumptions about the file structure and lacked explicit error handling beyond basic file existence checks. The initial code produced was often overly complex for the task at hand.

    python
    1import pandas as pd
    2
    3def convert_to_csv(input_file, output_file):
    4    try:
    5        df = pd.read_csv(input_file, sep='\t', header=None, names=['name', 'age', 'city']) # Assumptions made here
    6        df.to_csv(output_file, index=False)
    7        print(f"Successfully converted {input_file} to {output_file}")
    8    except FileNotFoundError:
    9        print(f"Error: Input file '{input_file}' not found.")
    10    except Exception as e:
    11        print(f"An error occurred: {e}")
    12
    13convert_to_csv('input.txt', 'output.csv')

Key Observation: All models generated syntactically correct code, but their interpretations of the vague prompt differed significantly. Claude Opus 4.5 correctly guessed the TSV format, giving it an initial edge. However, the lack of robust error handling and data validation was a common weakness across the board. Gemini 3's choice of Pandas indicated a potentially more powerful approach, but its initial implementation was more complex and prone to errors with unanticipated data structures.

Round 2: Refining the Code with Targeted Feedback

The next phase involved providing targeted feedback to each model, focusing on improving error handling, data validation, and flexibility. The goal was to see how well they could adapt and refine their code based on specific instructions.

Example Feedback (given to all models, tailored slightly):

  • "Add error handling to catch invalid age values (e.g., non-numeric or negative values). Skip rows with invalid age."
  • "Make the script handle different delimiters (comma, tab, space) and infer the delimiter if not explicitly specified."
  • "Add logging to track errors and skipped rows."

The results were revealing:

  • Claude Opus 4.5: Responded well to the feedback, adding robust error handling for age validation and incorporating a delimiter detection mechanism using heuristics (e.g., counting the most frequent separator). It used the logging module effectively.

  • GPT-5.1: Implemented the requested features, but its delimiter detection was less sophisticated than Claude Opus 4.5's. It relied on explicitly specifying the delimiter rather than attempting to infer it dynamically in all cases. Its error handling was adequate but less comprehensive.

  • Gemini 3: Struggled somewhat with the delimiter detection, producing less reliable code than the other two models. However, it demonstrated a greater capacity for data cleaning and transformation, suggesting Pandas functions like fillna for handling missing values. The logging implementation was basic.

Key Observation: Claude Opus 4.5 showed the most consistent responsiveness to feedback, producing the most robust and adaptable code. GPT-5.1 was a solid performer, but its delimiter detection was less sophisticated. Gemini 3, while offering powerful data manipulation capabilities, had difficulty with the delimiter inference aspect.

Round 3: Stress Testing with Edge Cases

The final round involved exposing the refined code to various edge cases and unexpected data formats. This included:

  • Empty fields
  • Extra columns
  • Mixed delimiters within the same file
  • Non-UTF-8 encoded data

This stress test revealed the true resilience of each model's code:

  • Claude Opus 4.5: Handled most edge cases gracefully due to its robust error handling and delimiter detection logic. It skipped rows with invalid data but logged the errors, providing valuable debugging information. It struggled with non-UTF-8 encoding.

  • GPT-5.1: Encountered more errors than Claude Opus 4.5, particularly with mixed delimiters and empty fields. Its error handling, while improved, was not comprehensive enough to catch all edge cases. It also struggled with non-UTF-8 encoding.

  • Gemini 3: Proved surprisingly resilient to empty fields, leveraging Pandas' fillna function effectively. However, it failed miserably with mixed delimiters and struggled with non-UTF-8 encoding.

Key Observation: Claude Opus 4.5 emerged as the clear winner in this round, demonstrating the most robust and adaptable code. GPT-5.1 performed adequately, but its error handling and delimiter detection were less sophisticated. Gemini 3, while showcasing Pandas' data manipulation prowess, faltered in several critical areas, particularly delimiter handling.

Actionable Takeaways

  1. Robust Error Handling is Paramount: Vague prompts necessitate robust error handling. AI-generated code should anticipate unexpected data and gracefully handle errors, providing informative logging for debugging. Without it, the code becomes brittle and unreliable in real-world scenarios.

  2. Delimiter Inference is a Key Differentiator: Accurately inferring data delimiters is crucial for handling messy, unstructured data. Claude Opus 4.5's heuristic-based approach proved more effective than relying solely on explicit specification.

  3. Pandas Power Comes at a Price: While Pandas offers powerful data manipulation capabilities, its complexity can introduce overhead and make the code less intuitive. Gemini 3's reliance on Pandas sometimes resulted in overly complex solutions and reduced resilience. Choose tools wisely based on the specific task requirements.

  4. Embrace Feedback Loops: AI-assisted coding is an iterative process. Providing targeted feedback and stress-testing the code is crucial for refining its accuracy, robustness, and adaptability.

  5. Encoding Matters: Ensure the AI model is capable of handling various character encodings, including non-UTF-8. This is essential for processing data from diverse sources.

In conclusion, while all three models demonstrated impressive coding capabilities, Claude Opus 4.5 consistently outperformed GPT-5.1 and Gemini 3 in this specific coding task. Its robust error handling, sophisticated delimiter inference, and consistent responsiveness to feedback made it the most reliable and adaptable solution. However, it's important to note that these results are specific to this particular task and may vary depending on the complexity, domain, and nature of the prompt. The future of AI in software development hinges on continuous improvement in these areas, pushing the boundaries of automation and unlocking new levels of productivity.

Source: https://www.reddit.com/r/GeminiAI/comments/1p8tx82/comparing_claude_opus_45_vs_gpt51_vs_gemini_3/