Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?
This is your opportunity to ask questions. No question too simple or too silly.
Culture war topics are accepted, and proposals for a better intro post are appreciated.
Jump in the discussion.
No email address required.
Notes -
Apologies for the naive question, but I'm largely ignorant of the nuts and bolts of AI/ML.
Many data formats in biology are just giant arrays, with each row representing a biological cell and columns representing a gene (RNA-Seq), parameter (flow cytometry). Sometimes rows are genetic variants and columns are various characteristics of said variant (minor allele frequency, predicted impact on protein function, etc).
Is there a way to feed this kind of data to LLMs? It seems trivial for chatGPT to parse 'This is an experiment looking at activation of CD8+ T cells, generate me a series of scatterplots and gates showcasing the data' but less trivial to parse the giant 500,000x15 (flow) or 10,000x20,000 (scRNA-Seq) arrays. Or is there a way for LLMs to interact with existing software?
What’s the advantage over normal programming?
This really feels like a pair of 'for' loops instead of a flexible task. You could even go up a level and write a tool that lets you pick different axes.
More options
Context Copy link
Why language models specifically? From a cursory google I found a couple of papers which may make more sense to you than me
https://www.sciencedirect.com/science/article/pii/S1672022922001668
https://www.frontiersin.org/articles/10.3389/fimmu.2021.787574/full
Do you want LLMs so you can "talk to" your lab results? Otherwise it's easier to analyse masses of data without the LLM middleman.
Yeah, exactly. There's a lot of grunt work involved in flow cytometry analysis which I was thinking of more than the scRNA-Seq. Machine learning for most basic flow cytometry is slightly overkill because conceptually what you're doing with each gate is conceptually pretty simple. I tried to elaborate/clarify in this comment.
You should send the grunt work to CCP where eve denzions can do it for fractions of a cent.
You've been repeatedly warned to stop doing low effort drive-bys like this that contribute nothing.
Banned for five days this time.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Interesting. I've spent a lot of time staring at t-SNE plots (or more recently UMAPs took over) and they map pretty well to our underlying understanding of the biology. It got a bit hairy when we asked it to split the data into too many clusters and it was difficult to know if we were looking at some novel, minor cell type or a hallucination.
I think I asked that question poorly and also lack the vocabulary to describe what I'm envisioning. Current software for analyzing this kind of data (flow) exists and the typical workflow is just making a series of scatterplots with 'gates,' or subsets of cells that express a given marker. Here's a basic example.
Verbally, it's all very simple - Gate on singlets, then lymphocytes via forward/side scatter, exclude dead cells, gate on CD3+ and then split into CD4 and CD8 T cells. It's the kind of instruction that should be very easy for chatGPT to parse even with a single sentence outlining the experiment. But how to feed the data? Is there a way for chatGPT to interact with an existing analysis software to draw gates/generate scatterplots...? I assume you wouldn't want to feed the raw array of cells into your prompt, although I don't know.
Maybe I'll back up and zoom out a bit. Most people use flowjo to analyze flow cytometry data. It's a multibillion dollar industry, they haven't updated the software in something like a decade (and that update made it worse than the version I was using before), and you routinely draw the same gates over and over again. Imagine you have 50 samples in your experiment, each sample has 10 gates so you're skimming over 500 scatter plots and then inputting however many readouts you have into other histogram plots to represent the data you got. It's repetitive and the software is clunky. LLMs definitely seem 'smart' enough to understand everything that's going on, but I don't have the first idea how you communicate that kind of data to them...
More options
Context Copy link
Sorry, I think my description of what I was thinking of was exceptionally poor. I tried to elaborate in this comment.
More options
Context Copy link
If you get something like this going let me know. I’m exploring local LLM use-case of cybersecurity packet analysis. Loading bulk data separately from prompt engineering etc, this all is complicated by a small 2k context length. Newer open models have landmark attention tech for 10/30k+ context length, but they are less sophisticated 7/13billion parameter models compared to the 30b ones I’ve been using.
More options
Context Copy link
I have no useful suggestion, but that's a neat idea! Great example of the kind of thing that AI could straightforwardly do and save a huge amount of man-hours of tedious, boring labor. The AI probably still won't know why anyone would care about a particular gate, but it could make it quite easy to visualize things.
More options
Context Copy link
I'm no expert but have some familiarity. The LLMs have a limited context window (gpt4 is 8000 tokens) so it can't hold all of that data at once. Probably the easiest way to get it to chew through that much is to ask it for code to do the things you want (directing it to write some pygraph or R code or something). It could plausibly do it inline if you asked it to summarize chunks of data, then fed the previous summary in with the next chunk. The code would act as a much more auditable and probably accurate tool though.
More options
Context Copy link
More options
Context Copy link