OpenAI Launches GPT-4.1 Family with Advanced Coding Capabilities

On Monday, OpenAI launched a new family of models, the GPT-4.1 series, which includes GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano. These models are designed to excel in coding and instruction-following, featuring a 1-million-token context window, which allows them to process about 750,000 words in a single pass. However, the models are available only through OpenAI’s API, not ChatGPT.

GPT-4.1 launches as OpenAI faces growing competition from companies like Google and Anthropic, which are intensifying their efforts to develop advanced programming models. Google’s recently introduced Gemini 2.5 Pro, which also has a 1-million-token context window, is performing well on popular coding benchmarks. Similarly, Anthropic’s Claude 3.7 Sonnet and Chinese AI startup DeepSeek’s upgraded V3 are making their mark.

>>>10000mAh S67 Replacement Battery for Oukitel S67

The ultimate goal for many tech giants, including OpenAI, is to train AI models capable of handling complex software engineering tasks. OpenAI’s ambitious vision, as shared by CFO Sarah Friar at a tech summit in London last month, is to create an “agentic software engineer.” The company envisions future models capable of building entire applications from scratch, managing tasks like quality assurance, bug testing, and documentation writing.

According to OpenAI, the full GPT-4.1 model outperforms GPT-4o and GPT-4o mini on coding benchmarks like SWE-bench. The GPT-4.1 mini and nano versions are designed to be faster and more efficient, though they sacrifice some accuracy. OpenAI claims the GPT-4.1 nano is the fastest and cheapest model in the series, priced at just $0.10 per million input tokens and $0.40 per million output tokens.

OpenAI’s internal testing shows that GPT-4.1, with its larger token generation capacity (32,768 tokens vs. 16,384 for GPT-4o), scored between 52% and 54.6% on the SWE-bench Verified subset, a human-validated portion of SWE-bench. (OpenAI noted that some SWE-bench Verified problems couldn’t run on its infrastructure, hence the score range.) These results are slightly lower than those of Google’s Gemini 2.5 Pro (63.8%) and Anthropic’s Claude 3.7 Sonnet (62.3%) on the same benchmark.

In another evaluation, OpenAI tested GPT-4.1 using Video-MME, a tool designed to assess how well a model “understands” video content. GPT-4.1 achieved a leading 72% accuracy in the “long, no subtitles” video category, according to OpenAI.

>>>1500mAh KYF31UAA Replacement Battery for Kyocera AU Gratina 4 G

Although GPT-4.1 performs well on benchmarks and benefits from a more recent knowledge cutoff (up to June 2024), it’s important to note that even top-tier models still struggle with tasks that should be relatively straightforward for experts. Numerous studies have shown that code-generating models often fail to resolve, or even introduce, security vulnerabilities and bugs.

OpenAI also acknowledges that GPT-4.1’s reliability decreases as the number of input tokens increases. In one of OpenAI’s own tests, the OpenAI-MRCR, the model’s accuracy dropped from about 84% with 8,000 tokens to just 50% with 1 million tokens. Additionally, GPT-4.1 tends to be more “literal” than GPT-4o, sometimes requiring more specific and explicit prompts to perform accurately.

Leave a Reply Cancel reply