The AI Dev Digest: LiteLLM Supply Chain Hack, llama.cpp Joins Hugging Face
This week's AI news that matters: a critical PyPI supply chain attack hits LiteLLM, llama.cpp officially joins Hugging Face, and massive MoE models start running on your MacBook.
Tob
Backend Developer
The AI infrastructure underneath your projects is moving fast. This week brought a wake-up call on package security, a landmark deal for open-source local AI, and a glimpse at just how far consumer hardware has come with big models.
TL;DR: A malicious version of LiteLLM briefly appeared on PyPI and could have stolen your SSH keys, AWS credentials, and crypto wallets. Meanwhile, llama.cpp and its creator Georgi Gerganov are officially joining Hugging Face. And massive MoE models like Qwen3.5-397B are now running on MacBooks and iPhones.
The LiteLLM Supply Chain Attack: Check Your Installed Versions Now
If your project uses LiteLLM and you updated recently, stop and check which version you have installed.
Malicious versions 1.82.7 and 1.82.8 were published to PyPI earlier this week. These were not accidents. They were credential harvesters.
The attack vector is worth understanding in detail because it shows how sophisticated these supply chain hits have become.
It started with Trivy, a popular open-source security scanner. Trivy's GitHub Actions workflow was compromised, which gave attackers access to PyPI publishing credentials. Those same credentials happened to be reused for the LiteLLM PyPI account. So attackers published two malicious versions under a trusted package name.
Version 1.82.7 hid the payload in the proxy server code, so it only triggered if you explicitly imported and ran the proxy. Version 1.82.8 was worse. It embedded the payload in litellm_init.pth, a special Python startup file. Installing the package was enough to execute the code. You did not even need to run a single line of LiteLLM code.
What did it steal? Everything sensitive it could find on your machine. SSH keys and git credentials. AWS, GCP, Azure, and Kubernetes config. Docker authentication. npm, Vault, and database passwords. Even cryptocurrency wallet data.
PyPI quarantined the package within hours, which limited the damage. But if you installed either version during that window, treat your credentials as compromised. Rotate everything. Yes, that means your AWS keys, your git config, everything.
This is a good moment to audit your dependency install practices. A few things that help:
Use virtual environments or containers for every project so a compromised package cannot reach your entire system. Enable your package manager's dependency cooldown feature if it has one. npm, pip, pnpm, Yarn, Bun, and uv all now support some form of minimum release age gating. The idea is simple: wait a few days before installing a new package version, give the community time to catch malicious code.
Pin your dependencies. Not just your direct dependencies, but your transitive ones too.
Subscribe to security mailing lists for the packages you depend on heavily. LiteLLM's GitHub issue had full technical details within hours of discovery.
llama.cpp Officially Joins Hugging Face
The biggest local AI story of the week is one that did not involve any drama at all. Georgi Gerganov and the llama.cpp team are joining Hugging Face.
llama.cpp is the engine that made local AI accessible to anyone with a half-decent laptop. It is the library that sits underneath Ollama, LM Studio, and countless custom inference setups. It handles model loading, quantization, and efficient inference on CPU and GPU without requiring a cluster.
Joining Hugging Face does not mean llama.cpp is changing. The team will maintain full technical autonomy. The project stays 100 percent open source and community driven. What changes is the long-term stability. Georgi and his team get resources, infrastructure, and institutional backing without surrendering control of their project.
The practical implication is more interesting than the headline. The blog post mentions working toward making it nearly one-click to run models from Hugging Face using llama.cpp. Today, taking a model from the Hugging Face hub and running it locally with llama.cpp involves a few steps that are not always obvious to newcomers. That friction is about to disappear.
This also signals that Hugging Face sees local inference as a core part of its long-term strategy, not a side experiment. llama.cpp is the de facto standard for local inference. Transformers is the de facto standard for model definitions. Having both under one roof makes Hugging Face a more complete platform for the full AI development lifecycle.
If you build anything with local AI, your stack just got a little more solid.
Massive MoE Models on Consumer Hardware: The Numbers Are Stunning
Here is a fact that would have sounded absurd two years ago. Qwen3.5-397B, a model with 397 billion parameters, ran in 48 gigabytes of RAM last week. The same model was demonstrated running on an iPhone. Not a server iPhone. The regular one in your pocket.
The secret is Mixture of Experts, MoE for short. Instead of activating every parameter for every token, MoE models only use a subset. A model with 397 billion total parameters might only activate 37 billion at any given moment. The rest of the weights are routed in based on the token being processed.
This is why Kimi K2.5, a 1 trillion parameter model with 32 billion active parameters, can run on a 96 gigabyte M2 Max MacBook Pro at usable speeds. The memory requirement is not the total parameter count. It is the active parameter count plus model overhead.
The performance numbers are real. GPT-OSS-20B, a 21 billion parameter model with 4 active experts out of 32, hits around 115 tokens per second on an M3 Ultra Mac. That is the kind of speed you would expect from a 3.6 billion parameter dense model, not a 21 billion one.
Why does this matter for developers? Inference is no longer automatically a cloud-only operation. Small teams and indie developers can run capable models on their own hardware without burning API credits. Fine-tuning and experimentation that used to require a GPU cluster can happen on a MacBook.
The tools are still rough around the edges. Streaming MoE weights from storage is technically complex. But the trajectory is clear. The gap between local and cloud inference capability is closing faster than most people realize.
Sources: Simon Willison, GitHub BerriAI/litellm#24512, Hugging Face Blog, Hugging Face Blog MoE, Hacker News