I along with others, years ago, have said that answers that don't address the question asked should be removed. In this case, malicious actors are using packages that are not related to the problem presented in the question. This will become more prevalent than package squatting.
The root of the problem is allowing pip to execute arbitrary code when installing a package from PyPI (default package index), combined with the complete lack of vetting of PyPI packages. Even something as simple as mistyping a package name can cause malicious code to be executed on your computer.
Installing packages (i.e. source code) for programming languages should not execute arbitrary code.
Does installation matter at all? Once you've installed a package, you're very likely to immediately include it in the project, build, and run. Any malicious code can easily run during the package's initialization in your app.
> Does installation matter at all? Once you've installed a package, you're very likely to immediately include it in the project, build, and run. Any malicious code can easily run during the package's initialization in your app.
True, but in the case I've mentioned, if you've mistyped the name of the package, you can safely uninstall it without any issues.
Furthermore, code is often ran in (sort of) sandboxed environments like Docker during development, in which case arbitrary code on runtime is less dangerous than arbitrary code on install time.
Is that not the case here? I'm not sure if pip has a mechanism similar to deb's pre/post-install hooks.
> Furthermore, code is often ran in (sort of) sandboxed environments like Docker during development, in which case arbitrary code on runtime is less dangerous than arbitrary code on install time.
Wouldn't the package install then also happen in a Docker container anyway, negating the problem? Or how would you install a package to your host environment and then use it from within a container?
Sounds like a good reason to generalize dev containers. At least both dev and deployment environment will be somewhat isolated. And the worst you can do from the dev container is commit bad code, which can be seen in code review.
This is true, but you can inspect the package (and its dependencies) once installed and before importing it. Now it is an all manual process (with default tools, as far as I know), which contributes to the current state that only very few people inspect their packages.
It can give people also a second change to notice, e.g. the typo in the package name.
Actually it is even worse because JS is more popular and average JS dependency tree is so much more massive. Any tiny forgotten transitive dependency of a dependency (or dev dependency) 15 layers deep can pull this off. Total leftpadization is not a thing on pypi due to different culture
Haha. This is the most secure part. When you start compiling, rust will happily download random "crates" and include them in your program, because "dependencies".
"curl | bash" is the exact same threat vector as "download this compiled installer". I wonder why that doesn't result in the same hyperbolic responses.
The difference is that "curl | bash" doesn't save the output and can be distinguished server-side from commands that do <https://news.ycombinator.com/item?id=34145799>, so it's an especially easy way for a malicious server to undetectably have you run something different from what you've looked over.
Pipe to bash install approaches at least can just be snagged with wget to inspect the script without running it, lots of package managers seem harder than that to inspect/predict? Not that there couldn't be sneaky things you'd miss, but at least obvious ones might be detectable...
Good idea, unless the attacker serves you just more code that fetches and executes a one-off script from memory (and the download link for that is dynamic and only valid once).
we could go down a rabbit hole of various exploits and defenses, but at end of the day it all comes down to trust. If you don't trust the source of the code, it doesn't matter if it comes in via curl or a .deb signed with GPG, you're still trying to run untrusted code. If your threat model is such that you don't want to do that, don't do that. No one's forcing anybody to run curl | sudo bash at gunpoint.
This.. AND… dev machines should be isolated environments via something like GitHub Codespaces, local docker containers or Google’s IDX.
You also shouldn’t be able to push code directly to anywhere other than a centralized repository. All the build stuff should happen in a dedicated and independent process.
From a security point of view you need to accept that the idea of running into a malicious package as a direct or indirect dependency is not only not zero but fairly realistic and you should try to limit the blast radius as much as possible for when that does happen.
I don’t know why you’re trying to make this some kind of gotcha. What I said was absolutely 100% standard security advice. You’re just making yourself look silly here.
Other languages with dependency management frameworks (Java with Maven Central) go about this a little different: you can download as many evil dependencies as you like, but unless you actively execute them, nothing will happen. There are some caveats, especially with nasty frameworks that scan the entire classpath, but that's a different issue.
You need an analogue of Linux distributions (Python distribution? Making choosing a repo a separate and deliberate step?).
That way you could pick a repo based on your risk appetite rather than needing to trust PyPI. Debian style python for the cautious and AUR style python for the bleeding edge and reckless.
This is a valid and significant issue, though for our internal packages running arbitrary code on execution has been a lifesaver. Think: complex distributed systems where all nodes need to be synchronized with the version on the driver.
You could probably create a more convincing version of this with a bit of collusion between multiple accounts on StackOverflow.
If you have one person asking a deliberate question, someone else answering with a backdoored package that does actually does solve the problem, then a few accounts upvoting and adding comments it would lot a more more convincing than just a random answer with zero upvotes that doesn't actually solve the problem.
But you would also need someone to search for and find that answer in order to actually deliver you payload. Answering random questions has the advantage that the original poster is incentivized to execute your code.
This would very likely set up one of the voting ring detection mechanisms. It's not so easy to manipulate voting on Stack Overflow – quite a lot of work goes on behind the scenes to stop this. I'm not saying it's impossible, but it's not easy either and requires significant effort. Combined with that your malicious package is also quite likely to be detected fairly quickly(-ish), I'm going to say that it's probably not that effective of a mechanism.
How about scrape SO to find new, popular questions; have an LLM write a convincing solution; tack on “you could also try pip install notmalware, which is much faster”; automatically post that poisoned answer.
Extra credit: have the infected user’s computer also start posting these answers to SO
Users have been pushing their work in answers for years. Not a surprise that some are malicious.
In my case, I never bother looking at answers that prescribe a dependency. I find that it’s very important that I completely understand any code that I use from SO (or any other source), and I usually change the code to better fit my specific use case (and it’s usually a pretty small code sample).
I suppose this is the one time where stackoverflow being a cesspool is actually useful. Any attempt to post malware would just be marked as duplicate and closed instantly
Only questions are closed as duplicates. Stackoverflow is strict and pedantic to the point of being very unfriendly to newbies, but it doesn't close things at random.
If you want to criticize Stackoverflow, there's plenty of solid ground to do that. No need to say things that aren't.
> Stackoverflow is strict and pedantic to the point of being very unfriendly to newbies
at what point do you stop being a newbie? I have been on SO for 14 years and have found "lately" or the last five or so years to be extraordinarily unfriendly. I am in the middle of solving a complex problem so I posit a question but it's stripped to bare bones to make it easier to answer -- and the answer is almost always is not what I seek but instead "why are you doing this". Aaaaargh. I would need to post War & Peace to make you understand why so I just go and delete the question when this happens which is too often. Example: https://stackoverflow.com/q/77202800/308851 lots of comments asking why I am doing this were deleted but downvote remains. I kept this one up for whatever reason although downvotes usually are enough to make me delete a question.
My advice: if you can't help then stay away from the question. Alas, this likely won't reach the hopped-on-SO-power idiots but it's worth a try.
My strategy is to wait and see. Initially downvoted questions can become quite popular. The people doing the downvoting on new questions are a minority. Majority of traffic and votes comes over months from search engines.
I was refraining from making a comment that these would be easy to spot because they were actually being helpful and (apparently) answering the question..
Yes, I would be a lot more concerned about that than people finding stuff on stackoverflow. I can see copilot or some other gpt "suggest" stuff like this soon
Plug: I’ve been building Packj [1] to detect malicious PyPI/NPM/Ruby/PHP/etc. dependencies using behavioral analysis. It uses static+dynamic code analysis to scan for indicators of compromise (e.g., spawning of shell, use of SSH keys, network communication, use of decode+eval, etc). It also checks for several metadata attributes to detect bad actors (e.g., typo squatting).
Certainly not a new thing; but a good reminder to be very careful when copy and pasting code from Stackoverflow. Surprised that they'd not using a load of bots to upvote the malware-laden answers to make them seem more legitimate.
Any unusual voting pattern would certainly make it more likely that Stackoverflow noticed and investigated. But I wonder how many people are likely to try downloading a library they've never heard of from a post with zero upvotes?
Of course, once an LLM scrapes that answer and regurgitates it, then you'd never know.
A truly patient malware author wouldn't even necessarily need to make it obvious. You could upvote something over the course of six months and still have a very negative impact.
Hell, you could create a library, not have malware in it, answer the questions, wait six months, add the malware. Oh wait, I'm not a criminal. Forget I said that.
Patience is unlikely for individuals, but it becomes a real concern when you come to the point of state-based actors.
I regularly use Python libraries in my work that emulate Excel. and they're so often out of date, they're so abandoned/unloved for the most part because the people that write them eventually move on and they don't have a big enough user base.
That means that they're ripe for somebody inserting stuff. Worse Excel is usually used by big companies, so it's a target as well.
So much wasted effort. Just create an account on Upwork, play by Upworks rules & get complete unferreted access to any number of servers and websites run by small businesses and marketing consultancies. Sure you might have to do a bit of Wordpress dev to keep the illusion going but...there you go.
Would be pretty easy to manipulate a real question and be less obvious by solving the problem, but then the malicious user also embeds a bad package, within their answer.
Solving the problem, so the real user would make it the selected working answer, which would help to cascade the malicious package for others that come to solve the same issue.
This is much bigger than this particular issue. This is the same as the Xz backdoor: supply chain attack. It’s sad, but the fact is you cannot actually trust anyone else’s code unless you audit it yourself. Even something seemingly-well-vetted like Xz can be a vehicle for malware.
Do virus scanners know how to interpret base64 encoding when they scan a file? I assumed they would be able to catch something like this. Or wouldn't the OS catch this .exe file when it is being downloaded?
This reads like yet another major attack vector through LLMs. A threat actor only needs to provide a few mentions of the package on reputable sites and it is almost guaranteed that your backdoored package will be mentioned by the LLM once it is retrained with new data, or even in cases like RAG.
I am guessing I am not the first to think of this though. Wouldn’t be surprised if this kind of attack vector is already being set in motion for all kinds of other purposes; product reviews, etc.
It is an important consideration for live language models. The example recently of a bad Reddit answer popping up in Gemini -- putting glue on the pizza demonstrates how easy it is.
While there are definitely ethical issues with the data that they are using. Trainers need to get a handle on this kind of thing because for LLMs to be truly useful they have to consume large parts of the web.
Perhaps a ... human curated index ... would be more useful then? :)
I mean, if you're manually curating what you feed to the LLM you end up doing one of the web directories of old and might as well skip the LLM part... or use it just to gain funding...