OpenAI o1 when it FAILS

16 0 2 minutes read

My very first view at the new OpenAI o1 – preview, as published yesterday, especially for science and advanced reasoning. You are live with my very first encounter and tests of OpenAI o1.

Unfortunately I hid the max quota of OpenAI o1 real fast, but recorded my tests on OpenAI. And this video presents my chronological questions and test, exploring OpenAI o1 for the first time.

My test was a simple test to generate thematic topic clusters of the latest 70 AI research papers of yesterday, with OpenAI o1-preview model. Normally that would involve a SBERT sentence transformer model, with a domain specific tokenizer, a dimensional reduction with UMAP from a high-dimensional vector space and further optimizations, since all the text were on a brand new research topic, unseen by any AI system.

Q: Why my title with the word “FAIL” in it?
A: Given my uploaded text segment of 70 scientific titles and technical annexes, o1 failed in the first attempt and reported back that it detected only 8 papers. This is a clear fail. I have the hypothesis, that then another agent network activated, kind of a “give him 1 agent at first and only when necessary, activate the whole fleet”. IF o1 has an internal self-validation check, this should have not happened. If a human has continuously to monitor and evaluate each response by AI, it gets boring real soon and I could do the thinking myself. And any economic thought of an industrial AI application – for profit oriented companies – vanishes.

Although not perfect, an intelligent tool – especially for science to examine cross-discipline publications and uncover thematic insights.

By the way, the recorded failure of OpenAi o1 to immediately recognize that the text contains 70 research papers, could theoretically indicate the some agents were activated in the second attempt and then OmniOne succeeded. But we have to wait for the technical paper by OpenAI.

Nice @OpenAI

All rights with me. Looking forward to continue my test since this was only a first look and in no way a real profound testing regime. Therefore try it out yourself, why not leave your impressions in the comments and next time we all know a bit more about the real performance of OpenAI o1. The tactic, that we are only allowed real limited access and then have to wait for weeks (!) for further testing is not helping the community to assess the performance of OmniOne. So let us be patient. Smile.

#airesearch
#chatgpt
#airesearchlab #openai

source

Dr Woblok 3 days ago

16 0 2 minutes read

16 Comments

@code4AI says:
September 15, 2024 at 5:10 pm
Now I am reading the first marketing material by OpenAI about o1 and especially a tweet by Noam Brown (@OpenAI) "we aim for future versions to think for hours, days, even weeks. Inference costs will be higher, but what cost would you pay for a new cancer drug? For breakthrough batteries? " So this is an interesting new phenomenon: You need pre-training your model just as a start impulse, and then the model (o1), trained with RL to think via "private (??) chain of thoughts" before responding? Then the model (o1) becomes better at reasoning tasks – the longer it "thinks"? Can a limited brain (me and o1) think new worlds it has never experienced? And the more "it" thinks, causal reasoning performance increases? strange idea that you don't learn from experience and failure … but from abstract thinking in the wild and self-reflection alone, without a link to (human) reality …. Where is the logic threshold for this method?
@joaooliveira7051 says:
September 15, 2024 at 5:10 pm
GREAT VIDEO
Clustering multidisciplinary subjects is the reason for me to have learned a little python at 56 years old and also trying to follow your videos (while not having a information tech background).
This video is very interesting, thanks!
Also like your marketing approach of being negative in the title. I do not agree it is misleading, as above comments.
I still find it interesting to manage how we can "enforce" clustering, a little as o1 did for new articles… Somehow, I would like yo have more control of the process.
Thanks again!
@LuisBustos-jq8sz says:
September 15, 2024 at 5:10 pm
You can use o1 in openrouter i dont know if there is a limit.
@serkhetreo2489 says:
September 15, 2024 at 5:10 pm
How do you read this number of papers?
@lpanebr says:
September 15, 2024 at 5:10 pm
Amazing! Great content!
@DanFrederiksen says:
September 15, 2024 at 5:10 pm
downvote for misleading title
@Haz2288 says:
September 15, 2024 at 5:10 pm
Can we be honest about o1? Everyone has been tinkering with CoT prompting. It's not a new thing and the results are about what we all expected. The hard part about building with LLMs is gathering, managing, and encapsulating context. o1 is OpenAI joining the other big tech companies in launching half-baked products to pump the share price/valuation. Welcome to the club 🤝
@Pils10 says:
September 15, 2024 at 5:10 pm
My main complaint with OpenAI's o1 model is that it only allows for about 30 messages per week! As someone who uses AI for work related stuff, I easily reached the quota in under 4 hours and now need to wait a week just to get access to it again. I get it that the model might require more compute, but with that restrictions I don't see o1 as a reason to subscribe to ChatGPT-Plus or Teams, especially if the Advanced Voice feature isn't rolled out to you. In arround half of all queries I have to reprompt the model multiple times just to get something working, resuling in me losing 3-5 messages for just implementing one, moderately complicated function. For "simple" prompts, this new model is definitely a step up, but everything that requires deeper reasoning and knowledge, the model still is nowhere near as nessesairy for it to significantly enhance or replace human entry-level work, especially when it comes to coding. At least in my opinion.
@dontmindmejustwatching says:
September 15, 2024 at 5:10 pm
thanks, I needed some of that.
@ricardodoll says:
September 15, 2024 at 5:10 pm
I think I did too much derivatives, and maybe there is a policy regarding the excessive use of math and code… How knows?
"
Hello,
We are reaching out to you as a user of OpenAI’s ChatGPT because some of the requests associated with the email * have been flagged by our systems to be in violation of our policy against attempting to circumvent safeguards or safety mitigations in our services.
Please halt this activity and ensure you are using ChatGPT in accordance with our Terms of Use and our Usage Policies. Additional violations of this policy may result in loss of access to OpenAI o1.
If you believe this is in error and would like to appeal, please contact us through our help center.
The OpenAI team
"
@mirek190 says:
September 15, 2024 at 5:10 pm
Sooo where fail ?
@SirajFlorida says:
September 15, 2024 at 5:10 pm
It's been fun testing the new model. That's for sure.
@smicha15 says:
September 15, 2024 at 5:10 pm
this entire video shows the chat is using GPT4o, not GPT o1 preview… perhaps you used gpt 01 preview, but after switching back to the chat, the model choice switched…
@ToddWBucy-lf8yz says:
September 15, 2024 at 5:10 pm
Thank you for the Zero hype, keep it real review.
@bastabey2652 says:
September 15, 2024 at 5:10 pm
did you provide of research papers or only the summary?
@user-de9hv2gu9z says:
September 15, 2024 at 5:10 pm
Perhaps, Anthropic has a plan to launch a new model similar to o1? Thanks again for your insightful video