“Apple Says its ReALM Is Better Than GPT-4 at This Task”

Apple’s research team has published a paper asserting that their ReALM language model outperforms OpenAI’s GPT-4 in “reference resolution.”

On Friday, Apple researchers shared a preprint paper about their ReALM large language model, stating it significantly surpasses OpenAI’s GPT-4 in specific benchmarks. ReALM is designed to better grasp and manage various contexts, potentially enabling users to reference items on their screen or in the background and ask the language model questions about them.

ReALM

Reference resolution is about pinpointing what specific expressions mean, like when we use “they” or “that” in conversation. While humans can often deduce the meaning from context, chatbots like ChatGPT might find this challenging. Apple highlights the importance of this capability for chatbots, noting that enabling users to reference things on a screen using terms like “that” or “it” and having the chatbot understand precisely could be key to a truly hands-free screen experience.

Apple’s latest publication is one of three AI-focused papers the company has released in recent months, hinting at potential new features for iOS and macOS. The researchers aim to enhance ReALM’s ability to recognize three types of entities: onscreen, conversational, and background entities. Onscreen entities are visible on the display, while conversational entities pertain to the ongoing dialogue. For instance, if a user asks a chatbot about their daily workout, the chatbot should deduce the context from past interactions, like knowing the user’s 3-day workout plan and the specific exercises for that day.

Apple’s research includes understanding background entities, which are neither onscreen nor conversational but still pertinent, like a background podcast or a recent notification. ReALM aims to recognize these when users mention them.

In their paper, the researchers state, “We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5 percent for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.”

The researchers pointed out that their approach with GPT-3.5 involved only text prompts, while with GPT-4, they included a screenshot to enhance the task’s performance.

In the paper, they mention, “Note that our ChatGPT prompt and prompt+image formulation are, to the best of our knowledge, novel. While we believe it might be possible to further improve results, for example, by sampling semantically similar utterances up until we hit the prompt length, this more complex approach deserves further, dedicated exploration, and we leave this to future work.”

Although ReALM outperforms GPT-4 in this specific benchmark, it’s not entirely fair to declare ReALM the superior model overall. ReALM was explicitly designed to excel in this particular test. Additionally, the details about when or how Apple intends to incorporate ReALM into its products remain uncertain.

Leave a Reply

Your email address will not be published. Required fields are marked *