site banner

Calling all Lurkers: Share your Dreams of Effortposting

It’s been pointed out recently that the topics discussed in the Culture War thread have gotten a bit repetitive. While I do think the Motte has a good spread on intellectual discussion, I’m always pushing for a wider range (dare I say diversity?) of viewpoints and topics in the CW thread.

I was a lurker for years, and I know that the barrier between having a thought and writing a top level comment in the CW thread can loom large indeed. Luckily I’m fresh out of inspiration, and would love to hear thoughts from folks about effortposts they want to write but haven’t gotten around to.

This of course applies to regulars who post frequently as well - share any and all topics you wish were discussed in the CW thread!

28
Jump in the discussion.

No email address required.

Does it make sense to take gwern's url archiving to its radical conclusion and start automatically saving-to-disk webvideo and random forums and every version of every executable I ever download? Would you get away with it for long before Google slammed you (I don't think youtube-dl is detectable?)? The tools for automatic categorization and identification exist (among others), but has anyone actually put them into a moderately user-friendly format to actually find stuff once you have archived it?

I could write this, I don't really do effortposts though.

Does it make sense? Yes, content disappears very rapidly. A lot of youtube videos I wanted to revisit have disappeared, a lot of random sites, a lot of tweets lost to suspensions or history-cleans, free PDFS that aren't free a year later, hosted images on discordapp.com or imgur, ...

If you mean "archiving every video you watch and website you visit", you'd get away with it easily. Youtube is botted, and videos have lots of bytes, so it has some bot protection - but iirc isn't that strong, and the yt-dlp maintainers have consistently gotten past it, so using that on every video you visit would work. So just [visit youtube.com/* url] -> [run yt-dlp to save video]. For every other website you visit, it's mostly just html, so (ignoring a ton of relatively boring things) you can just save the html, and the images and videos within, and view it later. For discord specifically, discord dump style tools work fine. I'm pretty sure google could kill yt-dlp if they wanted to (imagine they make a backlog of browser quirks, and every day they release a new youtube patch that modifies the JS challenge to depend on that quirk, requiring yt-dlp's JS emulator to be updated daily or run a full browser emulator), but they don't for some reason, even though they have stronger anti-bot protection in other areas.

If you're imagining something larger-scale, archiving everything on a forum or tens of thousands of youtube videos - also doable! Archive team, archive.org, has done that kind of thing for a long time. In the case of youtube videos specifically, it's so easy (and videos are so large, and most of youtube content is so useless) that archive.org actively asks people to not upload random scraped videos.

While big tech puts a lot of effort into defeating bots, it does cost money for dev time and is a maintenance burden, so they only do in areas it's worth preventing bots, e.g. account creation and posting of content. Most top 100 websites can be trivially scraped at within-10x-of-human-activity intensities, because they already have tens of millions of users doing that, so preventing read-only bots at that scale doesn't meaningfully affect load.

but has anyone actually put them into a moderately user-friendly format to actually find stuff once you have archived it

I imagine fast full-text search or embedding-based search would work fine here. I'm pretty sure there are open-source tools for both 'save every text you look at and search it' that are janky, as well as startups working on making a good UI for it.

what's wrong with web app deployment

This has improved a ton recently, with tons of commercial products and open source projects. Also, eevee was doing "See, I actually have a 64-bit kernel, but a 32-bit userspace", which ... it'd take a ton of effort to seamlessly support every quirky configuration people can come up with, so most devs don't, which is correctly prioritized imo. Again, with the database, they didn't use the supported configuration of 'give it root' and did some permission thing.

A little on the logistics side -- how and why is Etsy pulling 6% of sales fees and eBay 13% compared to PayPal's (already high) 3%

I know less here but ... 3% extra for etsy seems reasonable? Maybe not reasonable in the sense of 'how the economy should be', but reasonable in the sense that they have to address regulatory complexity, develop their software, deal with payments issues, prevent fraud... patio11's writing might be related

How much do and should we trust a lot of User Design stuff as reflecting what is measured, rather than what it studied?

Design is tightly coupled to revenue, which means companies and the people in them will be properly incentivized to care about it. If we imagine a psych lab doing experiments on college students, where ... any result is fine if you can publish it, and the pricing page of your SaaS, where your main source of revenue is people clicking and you really want them to click - if, in the former, you spew out a bunch of 2% uplift nudges that, when all implemented, add up to 0%, you can still publish, nobody's checking. If in the latter, you spew out a bunch of 2% uplift nudges that, when all implemented, add up to 0% ... you're not getting that bonus.

so much residential internet or screen size is (and during the study time especially was) high-variance enough it seems like these should have been swamped by noise

If it's actually per-user noise, sample sizes of 50M users x 100 interactions per day (adding together that many normals reduces your standard deviation by 70,000x!) are more than enough to wash it out for 'latency of every page load'. Even for 'converting to paid user', that's still a few million interactions total, which is more than enough. If there are ten groups of users with entirely separate behavior, that's still only 3x higher 'noise', which isn't that much.

In the particular case of latency - I definitely do notice latency and use sites less that take longer to load. Consumers being price-conscious in consumer goods, especially commonly purchased ones, is pretty well established, although idk the specifics of what you're referring to.

Suck thread was great.

Is there a (non-violent) solution to the problem of scam spamming, especially of the elderly, even if only a partial mitigation

I thought of "social media companies take it as seriously as they do racism", but they don't deal with that effectively either. Maybe as seriously as they do CP or ISIS (but even for CP they're not great).

Are there any One Tricks for documentation? Not just in a code context; I hate javaDocs, but they do seem a genuine tool, and weird that they're such single examples.

Not sure what you mean exactly. I also hate javadocs, /** @param1 int A Number @param2 int Another number @returns Two numbers added @desc Adds two numbers */ public int add(int param1, int param2)

While big tech puts a lot of effort into defeating bots, it does cost money for dev time and is a maintenance burden, so they only do in areas it's worth preventing bots, e.g. account creation and posting of content. Most top 100 websites can be trivially scraped at within-10x-of-human-activity intensities, because they already have tens of millions of users doing that, so preventing read-only bots at that scale doesn't meaningfully affect load.

That's fascinating to hear.

I imagine fast full-text search or embedding-based search would work fine here. I'm pretty sure there are open-source tools for both 'save every text you look at and search it' that are janky, as well as startups working on making a good UI for it.

Full-text search has a lot of applicable tooling, if you aren't just willing to learn grep (which, tbf...). Embeddings... there's a lot of image archivers that can (try to) identify and tag people (eg nextcloud here), and general objects it's probably possible to replace them with yolo models, but I haven't found much that's a great way to actually find stuff. And other spaces like automated transcription's always a little tricky.

Also, eevee was doing "See, I actually have a 64-bit kernel, but a 32-bit userspace", which ... it'd take a ton of effort to seamlessly support every quirky configuration people can come up with, so most devs don't, which is correctly prioritized imo. Again, with the database, they didn't use the supported configuration of 'give it root' and did some permission thing.

That's somewhat fair, although more so for the mixed-bit size than for giving root to random software (and even for the mixed-bit-size problem, it's a little discouraging that "we don't support 32-bit" or "here's this vital extension" isn't in the documentation). And I can certainly understand and empathize the problems with end-users wanting support across ridiculous breadths of deployment environment: I've submitted code to fix one-off problems that likely only applied to small circumstances like mine, and I can understand when they were accepted or rejected.

But I don't think eevee's issues were, and my issues are generally not, about one piece of software having a problem in one environment. A sizable portion of eevee's problems were less about the specialized failure modes, and more that even the canonical install paths aren't really complete (or are Docker, or more recently flatpak has started showing up for no goddamn reason). At the time of writing, Discourse did not say install Docker or else. It had a pretty long installation guide! But it did not (even at the time of eevee's writing; it was deleted the day after that post) actually cover things like 'what are actually the dependencies', rather than the minimum number of apt-get calls to get it to build on the author's machine.

That's not just the fault of the Discourse designers. The problem's that development in general (nuGet and maven have encouraged the exact same bad habits!), but especially web development, no longer has and often does not expect anyone to have the ability to seriously inspect dependencies, even as dependency trees have expanded. If you are very careful, you might be able to get your application to list all of its immediate dependencies (no one did for Discourse, hence the sidekiq bit, so it's a little bit Discourse's fault), in terms of full application-level dependencies. But those will have their own dependencies (or extensions, or modules, or packages, yada yada), which you might be able to get a list of what's currently installed. And in increasing situations, you'd have to dance down another level from that.

Docker bypasses this by pulling from specific installation images in order and just not caring if something else gets pulled along by accident -- which, hey, I'd be fine with on small scales. But then it installs a copy for each container. Which does solve dependency hell, since there's now one dependency install per application... at the cost of making it increasingly easy to have dozens of (oft-outdated) versions of common dependencies.

This would be a little annoying if it were just a problem during install, but maintenance and updating tends to be where it goes really bad. I've had multiple GitLab instances -- even with the 'recommended' omnibus! -- where upgrading just exploded because one version somewhere was out of whack. NextCloud just had a (nontrivial!) bug related to php versioning support. Even with grav, which is supposed to about as simple as it gets, I've still seen it go tango uniform because of a dependency versioning problem the developer was unaware of.

3% extra for etsy seems reasonable? Maybe not reasonable in the sense of 'how the economy should be', but reasonable in the sense that they have to address regulatory complexity, develop their software, deal with payments issues, prevent fraud... patio11's writing might be related

Patio11's writing is fantastic, but Patio11 also works for Stripe, which offers (listed) sales cuts around ~3% total, and hasn't taken over the world. Part of that's because Stripe doesn't want to (or, rather, Stripe's banks don't want it to), but another part is that there's not a horde of startups breaking down Stripe's door to take Etsy's lunch and 'only' make a billion USD... nor to provide a valet service and to charge 20%.

I could definitely imagine a vendor that gave the average seller 3%ish worth (or even 10%+/30%+!) of sales in benefits. It's actually not that hard, and that's a pretty reasonable cut in some circumstances. Amazon itself has bizarrely tight economic tolerances -- which doesn't mean it's an efficient marketplace, but winks and nods that direction -- and much of its business-side income comes only from shaking down sellers advertising. It's weird that it's turned into the standard for online sales even as a lot of these groups are doing less and worse, while no competitors are coming up at the extreme low-end, nor that more reputable vendors charging a little more (or providing fewer sales-assist services) haven't come forward. Amazon-style drop-shipping comes across as from what seems like a narrow maxima for a fairly broad sphere, despite being incredibly janky, and I don't think the conventional explanation makes sense.

The punchline to this twitter thread is that the Menards replacement probably ranged from 10 bucks more to 60 bucks less, depending on what popular wheelbarrow eigenrobot was getting and what shipping he used. You can buy end mills on Aliexpress, Etsy, Amazon, and they'll be the exact same end mill from the exact same manufacturer, for radically different prices. Or if you end up having to do currency conversion, Paypal ends of breaking normal expectations there.

So you don't have a hugely price-conscious buyers, nor hugely convenience-based, nor is it obviously trickery (as bad as Amazon or Paypal dark arts get, they're not actually earned that much cash). Is it just being a first-mover? Internet-wide search gone fucky? Scale-and-size? Reputation (if so, how bad would Amazon or ali* have to get)? People just hate having multiple logins?

Design is tightly coupled to revenue, which means companies and the people in them will be properly incentivized to care about it.

Fair. I guess just post Pivot To Video I'm kinda nervous about highly-publicized 'studies' by a corporation with One Weird Trick and a lot of reasons that replication failures wouldn't 'count'.

Not sure what you mean exactly. I also hate javadocs...

I hate javadocs, too, but a) people write them, b) people update them, and c) external users can read them, even if most don't. But I'm more motioning about how they're a documentation technology, in a way that technologies-used-for-documenation (eg, wikis, technical writers) are not, even if they aren't particularly effective. It's weird that this isn't something more common or more widely exploited beyond bad puns about self-documenting code.