site banner

Small-Scale Question Sunday for August 6, 2023

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

2
Jump in the discussion.

No email address required.

The only long-form corpus of text under my real name would be my research papers. I guess I'm glad they all got rewritten heavily by my professor.

On the other hand, I am skeptical purely because of how statistics work. Like seriously, I can't find a single paper on this shit, that isn't amateur hour.

It would not be so trivial to produce a low error model that produces high confidence matches unless you are so online with both your pseudonymous account and real account that one number reaches 0 and the other 1, purely out of large numbers.

Sometimes you really do run into the limits of information theory.

Yeah, there's a lot of text on the internet. With a pretty cursory Bayesian analysis, even with a 99.9% accuracy you're looking at a thousand false positives if you are combing through a million posts. Without some other thing to narrow it down, it seems reasonable that it'll not be possible from writing patterns alone.

Good point. On top of the difficulties of modeling such a thing. Unless said model has obscenely (I'm talking retardedly fucking insane moronic batshit crazy) high accuracy, there is always going to be plausible deniability behind false positives.

You will probably Light Yagami yourself with information you gave away about your personal life long before they can fingerprint your text.

You will probably Light Yagami yourself with information you gave away about your personal life long before they can fingerprint your text.

Sure, but the point is that these methods overlap and you can use a powerful LLM to parse high-likelihood text samples for shared details (or even things like shared interests, obscure facts, specific jargon), narrowing down your list of a thousand matches. Plus the passwords/emails thing is really important, most people reuse them (at least sometimes) and there are tons of leaked lists online, with that you can chain together pseudonymous identities (automatically, right now this is still extremely labor intensive so only happens with high-profile doxxings where suspicions already exist).

And I think writing styles are more unique than you think. Specific repeated spelling mistakes, specific repeated phrases, uncommon analogies or metaphors, weird punctuation quirks. And the size of the dataset for a regular user here (many hundreds of thousands of words, in quite a few cases) is likely enough for a model tuned on it to be really good at identifying the unique writing patterns of such a user.

Okay but where is the literature? Just show me theoretically it's possible. I would do the math that supports my side of the argument, but you know.. burden of proof.

The reasons discussed in the two comments above apply even with your new scenario. I don't think you understood the core of the arguments.

Also what you are saying can be done in the present, absent of a "powerful LLM". And no it can't be done automatically anytime soon because HTTP requests are not going to have a "fast takeoff" anytime soon.