Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?
This is your opportunity to ask questions. No question too simple or too silly.
Culture war topics are accepted, and proposals for a better intro post are appreciated.
Jump in the discussion.
No email address required.
Notes -
Data science is genuinely so fun with ChatGPT 4, Copilot and a decent modern GPU. Interesting paper, but no public GitHub/code. Pasted in 2000 words about their pipeline mostly copied from the paper, GPT 4 reproduced the (relatively complex) pipeline perfectly (in that I was getting almost identical results to the paper). Once I was having issues and I guess image support isn’t yet accessible (for me) in the OpenAI API, so I described (in a paragraph) a diagram and it understood it first attempt.
Copilot makes importing and processing data a joke, I guess there are probably more advanced ways to do it but I literally just #comment what I want it to do, press Enter, and press Tab, and it mostly figures it out. Tell it to make some interesting visualizations, it writes them, ask a query about the data, it can answer it. I’ve also been using copilot to generate clean CSVs from shoddy or otherwise messed up data that would be a nightmare to clean manually.
One of the best uses is finding old papers from pre-2015, pulling the original dataset if it’s public, briefly explaining to ChatGPT what the structure of the data and experiment is (I’ve tried this with copilot and it works sometimes, but actual ChatGPT (4) is more consistent, there are also people who have tried to automate this with the GPT API, but when I tried this code the results were inferior for some papers), and then just asking it to rewrite the approach a modern pipeline. Admittedly I guess this means a late 2021 pipeline given the GPT training set, but it’s enough to yield huge improvements in predictive performance.
I think this has underscored how much of the value moving forward will be in raw data. Foundational models with automated tuning will be used for everything, and LLMs will be able to tune, clean, prepare and modify code to make them work. “AI” is going to be cheap beyond compute costs, which will come down hugely anyway if everyone’s using the same small number of pretrained models (part of the reason why I think Nvidia’s bull run is going to end in tears) and software engineers are increasingly automated.
Instead, the money might well be in people who can navigate bureaucracies to acquire data, or who can collect/generate it themselves. I guess this explains Elon’s obsession with trying to prevent AI companies scraping Twitter, although anything online is going to be a free-for-all except the very largest datasets that people might pay for (sometimes). Niche datasets that you can’t just find on the internet or pull from Kaggle are going to valuable, especially because they might only be saleable once or a few times. The ‘AI layer’, critical though it is, will be almost impossible to make margin on if you’re not a bigtech, all the margin will be in the data itself.
(Also, how funny that I should have learned how to program like 3 months before it ceased to be useful)
Do you have a good CoPilot guide ?
I use it with VSCode, but I feel like I don't use it to the fullest of its abilities. It pretty much just serves as a worse GPT-4 for me.
Are you following the basics by using explicit/obvious function names and descriptive docstrings to start off with ?
More options
Context Copy link
More options
Context Copy link
I'm a data scientist/engineer and I use Gpt a lot. Like probably 95% of the code I wrote in the last 6 months was drafted by Gpt.
It's made a lot of things that would have been a 6-month-long NLP project into 2-3 day projects (the dev ops timeline is also much shorter because we just use the Openai API). And the results are far far superior. NLP projects are a breeze now and some things that were just straight impossible a few months back are easily done in a week or two (RLHF helps). It's actually amazing. I can't be spilling trade secrets but I can assure you with a bit of creativity you can get a lot more done than just making a chatbot or text summarizer (we use it in production for 5 different tasks that used to be done by people, our CEO is kind enough (can afford to) to keep them around for now, albeit with different responsibilities).
I do feel the need to defend the honor of my profession here given you have spoken about how we might just get the boot because Gpt can automate us away.. Color me this. Who do you think people are turning to to make things with GPT? I might even put "Prompt Engineer" in my LinkedIn bio soon.
Also, for NOW, it's not as simple as;
It's more like.
When you have to deal with real-world data and not stuff you download from Kaggle. There are database connections to make (and no tech lead in his right mind is going to let you connect to the database with an external API that you post to somewhere in the code !!), queries to be batched because the replica db times out, json parsers to be written because the field you want is within 15 nested jsons, etc; It's an impediment if you let it take the wheel.
I really wanted Code Interpreter to be useful for things past the simplest of Exploratory Analysis, but I don't think good EDA code that asks the right questions (the answers to which make money), is open-sourced. . Once again, it's alright. But its analysis skills are that of an undergrad who took 2 stats courses and uses techniques from the 1980s. I don't expect Gpt to become competent at statistics anytime soon given most of its training data is written by amateurs ( Good Data Science code is usually not open-sourced for obvious reasons).
And seriously, the above cannot be understated. In an ideal world, Data Scientists don't exist for writing
import pandas
and wrangling the shit out of that Data Frame. They exist for the same reason plumbers do...Gpt 5 might know where to strike. But Im not going to hold my breath because for a lack of training data.
I mostly agree with you, but I could also see software engineers going the way of the draftsmen: There used to be a huge sector that focused entirely on taking an engineer's ideas, and turning them into drawings on paper so they could be handed out to the clients. Then Computer Aided Drafting (CAD) software came about, productivity skyrocketed, and the demand for drawings rose modestly. Now there's a small sector focused on drafting, and some fraction engineers do their own drafting in the design stage.
Let's use your example of 40-60x increases in productivity, and imagine that competition keeps salaries at their current level. What if demand for software products only increases by 1000% due to the reduced job prices? The field could shrink to one fifth the size and still meet demand.
More options
Context Copy link
More options
Context Copy link
This is fascinating, and makes me want to play with data via these tools. I just wish I had an idea of how to get data to answer certain questions, or a better understanding of the infrastructure. Got any data science primers laying around?
Pick up a statistics textbook.
More options
Context Copy link
GPT itself is great here, it can very easily (and while explaining as much or as little as you want at every step along the way) guide you through the entire thing, from having a question about data through programming the project and to making pretty charts and interpreting them at the end.
Otherwise TowardsDataScience is good for lots of basics, and if you’re working with text skimming The NLTK Book (a free online thing) is good for understanding the basics of language-oriented ML/data science although a lot of it is currently being made obsolete by modern language models.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link