Near-Instant Dependency Restoration in GitHub Actions with SquashFS and FUSE

Near-Instant Dependency Restoration in GitHub Actions with SquashFS and FUSE
If you've ever watched a CI pipeline spend 30–60 seconds just extracting node_modules from a cache archive, you know the frustration. That's dead time on every single job, every single PR - and it adds up fast when your team runs hundreds of jobs per day.
In this post, I'll walk you through the solution I designed and implemented to cut dependency restoration from ~60 seconds down to ~3 seconds on self-hosted GitHub Actions runners. The core idea: stop extracting archives and start mounting them as read-only filesystem images, using SquashFS, FUSE, and a shared EFS volume on AWS EKS.
The problem
Our monorepo has around 772,000 files and 91,000 directories inside node_modules alone. The traditional caching approach - compress everything into a tar archive, store it somewhere shared, copy it back, extract it - had become a serious bottleneck.
Profiling the extraction step revealed that the time wasn't spent on decompression. It was spent on I/O and kernel overhead: creating hundreds of thousands of files and directories, setting permissions, writing metadata. Even on fast NVMe storage, the kernel's filesystem operations (mkdir, create, chmod) for 772K files consumed ~29 seconds of sys time that no amount of faster disk could eliminate.
This realization is what led us away from extraction entirely. If the bottleneck is creating files, then the answer is to not create them at all - mount a pre-built filesystem image and read directly from it.
The journey to the current solution
Getting here wasn't straightforward. The solution went through six iterations, and I think the failed attempts are just as instructive as the final result.
Attempt 1: tar + lz4 on EFS
The original approach. Dependencies were archived with tar -I lz4 and stored on an EFS volume. On restore, the archive was copied to local disk and extracted. Extraction took ~60 seconds. Profiling pointed to I/O as the bottleneck - writing all those files to EBS storage.
Lesson: Even with lz4 (one of the fastest compression algorithms), the massive volume of files got us I/O-bound. We needed faster disk.
Attempt 2: NVMe instance store
We tried m5d.8xlarge instances with 2x600 GB NVMe SSDs in RAID-0 (~1.1 TB) to eliminate the storage bottleneck. Extraction dropped from ~60s to ~30s - a meaningful improvement, but the remaining time was now syscall-bound. The kernel was spending 29 seconds just on filesystem metadata operations. Faster storage couldn't help anymore.
Lesson: Even the fastest NVMe SSDs only halved the time. The remaining 29s was pure kernel syscall overhead - no disk in the world could fix that. We needed to stop creating files altogether.
Attempt 3: erofs + kernel mounts
We mounted erofs images via mount -t erofs -o loop with kernel overlayfs on top. This achieved the ~5s restore target, but introduced two critical problems. First, loop device leaks - kernel loop devices persisted on the host after pod termination, and when runner pods crashed or were evicted, stale loop devices accumulated (up to 18 per node), eventually exhausting the pool. Cleaning up required terminating all EC2 instances in the node group. Second, Bottlerocket's SELinux blocked CAP_SYS_ADMIN, requiring full privileged: true for kernel mount syscalls.
Lesson: Mounting worked - 5s restores. But kernel loop devices leaked on pod crashes, accumulating until the entire node group had to be recycled. We needed userspace mounts.
Attempt 4: erofs + FUSE
The logical next step was userspace mounting with erofsfuse, but Ubuntu 22.04's erofs-utils package (v1.4) doesn't include it - that binary was added in v1.5+. Building from source introduced too much complexity for a CI optimization.
Lesson: erofs had no packaged FUSE support on our platform. squashfs did - mature, widely available, zero build-from-source hassle.
Attempt 5: squashfs + FUSE with partial cache hits
squashfs + FUSE worked great. We built it out with three restore paths - exact hit, partial hit, and full miss - to cover every scenario. The partial hit path did incremental caching: match on a restore-key prefix, run npm install on top, diff and re-squash. It worked, but added significant complexity for marginal benefit.
Lesson: The core approach (squashfs + FUSE) was right, but had too much complexity that caused issues. We needed to simplify.
Attempt 6: squashfs + FUSE, exact-or-miss (current)
Simplified to just two paths: exact cache hit or full miss. No partial hits, no incremental saves. The simplicity is the feature - lockfile changes are infrequent enough that the occasional full npm ci on a miss is a perfectly acceptable trade-off for dropping all the incremental overlay machinery.
Lesson: Dropping partial hits and incremental saves cut the codebase in half with no measurable impact on CI times. Simplicity won.
How it works
The core concept is simple: instead of extracting 772K files from an archive, mount a compressed filesystem image and layer a writable overlay on top. Reads come from the cached image, writes go to ephemeral local disk. The workspace looks and behaves exactly as if you ran npm ci - tools like Nx, TypeScript, and Vitest see a normal node_modules directory.
Only the cached directories (node_modules, .venv) are overlaid. Everything else - .git, source files, config files - stays on the native filesystem untouched.
Cache hit: mount and go (~3s)
When the cache key matches, the restore action:
- Copies the
.squashfsimage from EFS to local disk - Mounts it read-only via
squashfuse(FUSE - no kernel loop devices) - Layers a writable overlay via
fuse-overlayfsat each dependency directory's workspace path
That's it. The workspace is ready for builds and tests.
Cache miss: install and save
When there's no matching cache, the action falls back to a fresh npm ci. After a successful install, it saves a new squashfs image:
- Finds all
node_modulesdirectories in the workspace - Creates a staging directory with
sudo mount --bindfor each directory, preserving workspace-relative paths (passing multiple source directories tomksquashfsdirectly would cause it to flatten and rename duplicates likenode_modules_1,node_modules_2) - Compresses with
mksquashfsusing lz4 and all available processors (~12s for 764K files) - Uploads atomically to EFS with a run-unique
.tmp.$GITHUB_RUN_IDsuffix, then renames to prevent collisions from concurrent runners
Architecture
Here's what the storage layout looks like on a runner pod:
$GITHUB_WORKSPACE (native filesystem - unchanged)
├── .git/ ← native (no FUSE)
├── src/, apps/, packages/ ← native (no FUSE)
│
├── node_modules/ ← fuse-overlayfs mount
│ lower: squashfs image (cached, read-only)
│ upper: /local-cache/upper/... (ephemeral writes)
│
├── apps/app/node_modules/ ← fuse-overlayfs mount
│ lower: squashfs image (cached, read-only)
│ upper: /local-cache/upper/... (ephemeral writes)
│
└── packages/shared/node_modules/ ← fuse-overlayfs mount
...
The backing storage has two layers:
/local-cache- anemptyDirvolume on the pod. Holds the local.squashfscopy, squashfuse mount points, and overlay upper/work directories. Needs at least 10 GB capacity.- EFS (
/cache) - a sharedReadWriteManyPVC accessible from all runners. Stores the canonical squashfs images with 7-day retention.
Cache key strategy
Cache keys are computed from lockfile content only - no branch name. This means identical dependencies share the same cache across all branches.
For Node.js, the key format is node-<NODE_VERSION>-<HASH>, where the hash is computed from all package-lock.json files with .version and .packages[].version fields stripped out. This makes the cache resilient to CD version bumps that only change workspace package versions without affecting actual dependencies. A real dependency add, remove, or upgrade still changes the hash and triggers a miss.
For Python, the key is python-venvs-<HASH> computed from all poetry.lock files.
Why FUSE instead of kernel mounts
You might wonder why we use FUSE (squashfuse + fuse-overlayfs) rather than kernel mounts (mount -t squashfs + mount -t overlay), given that kernel mounts are generally faster. The answer comes down to operational safety in Kubernetes. FUSE mounts are process-owned - when a pod terminates (even on a crash), they're cleaned up automatically. Kernel loop devices, on the other hand, persist on the host. As we discovered with erofs (Attempt 3), crashed or evicted pods left stale loop devices that accumulated across nodes, eventually exhausting the pool and requiring full node group recycling. FUSE also avoids the kernel loop device pool entirely - squashfuse reads the image file directly. The privilege requirements are simpler too: FUSE only needs /dev/fuse, not full kernel mount capabilities.
Making FUSE work with GitHub Actions
This was the trickiest part. FUSE mounts behave differently from regular mounts, and the GitHub Actions runner has its own opinions about background processes. There are three specific quirks you need to work around.
1. Orphan process cleanup
The GitHub Actions runner kills background processes between workflow steps. It tracks them using the RUNNER_TRACKING_ID environment variable. Since squashfuse and fuse-overlayfs are daemon processes that run in the background, the runner will kill them between steps - unmounting your dependencies mid-workflow.
The fix: clear RUNNER_TRACKING_ID before launching the FUSE daemons.
RUNNER_TRACKING_ID="" squashfuse -o allow_other /local-cache/image.squashfs /local-cache/mnt
This prevents the runner from tracking these processes, so they survive across steps.
2. Process-scoped FUSE mounts
By default, FUSE mounts are only visible to the process that created them. Each GitHub Actions step runs in a separate shell process, so subsequent steps wouldn't see the mounts.
The fix: use the allow_other mount option, and make sure /etc/fuse.conf contains user_allow_other.
# In the runner image Dockerfile or setup
echo "user_allow_other" >> /etc/fuse.conf
# When mounting
RUNNER_TRACKING_ID="" squashfuse -o allow_other image.squashfs /mountpoint
RUNNER_TRACKING_ID="" fuse-overlayfs \
-o lowerdir=/local-cache/mnt/path,upperdir=/local-cache/upper/path,workdir=/local-cache/work/path \
-o allow_other \
$GITHUB_WORKSPACE/node_modules
3. Docker-in-Docker visibility
If your runners use a Docker-in-Docker sidecar (common for running service containers like Postgres), FUSE mounts created in the runner container are invisible to the dind container. This sounds like it would break Docker bind mounts (e.g., mounting init scripts into a Postgres container), but in practice it's not a problem - Docker bind mounts resolve through the shared work emptyDir filesystem, not through FUSE mount points. We originally added mount propagation (Bidirectional/HostToContainer) to solve this, but removed it after confirming all integration tests pass without it. No special configuration needed.
Infrastructure requirements
To implement this on your own EKS cluster, you need:
Runner pod configuration:
/dev/fuseaccess - required forsquashfuseandfuse-overlayfs. The simplest approach isprivileged: trueon the runner container, but this can be narrowed to just/dev/fusedevice access./local-cacheemptyDir volume - local scratch space for squashfs copies, mount points, and overlay layers. At least 10 GB./cachePVC mount - EFS withReadWriteManyaccess mode, shared across all runners.
Installed tools:
squashfuse- FUSE-based squashfs mounterfuse-overlayfs- FUSE-based overlay filesystemsquashfs-tools- providesmksquashfsfor creating images
EFS configuration:
- Accessible from all runner nodes
- Sufficient throughput for concurrent reads (6+ jobs reading simultaneously is common)
- The copy step retries up to 3 times with 2-second backoff to handle NFS stale file handles under concurrent access
Cache lifecycle and cleanup
Cache images on EFS are cleaned up automatically. A daily scheduled workflow runs at 03:00 UTC and deletes any file not accessed in the last 7 days. The restore action touches the EFS source file after every successful mount, so mtime reflects last successful use rather than creation time. Corrupt images that fail to mount are never touched and expire naturally.
For manual intervention, the same workflow accepts a custom age_days parameter (default: 7, 0 = delete all) and a dry_run flag to preview what would be deleted.
No per-branch cleanup on PR close is needed - since cache keys are lockfile-based and not branch-based, orphaned caches (from branches with unique lockfile hashes that have been merged or closed) expire through the daily purge.
File layout
For reference, here's the full file layout:
.github/
├── actions/
│ ├── local-cache-restore/action.yml # Copy squashfs from EFS + mount via squashfuse
│ ├── local-cache-save/action.yml # Bind mount staging + mksquashfs + upload to EFS
│ └── setup-workspace/action.yml # Orchestrate overlays + dependency install
└── workflows/
└── purge-cache.yml # Daily + manual cache cleanup
EFS (/cache):
├── node_modules_cache/
│ └── node-24.11.1-<hash>.squashfs # ~1.6 GB compressed
└── python_venv_cache/
└── python-venvs-<hash>.squashfs # ~1.2 GB compressed
Runner local (/local-cache):
├── node_modules_cache.squashfs # Local copy from EFS
├── mnt-node_modules_cache/ # squashfuse mount point (read-only)
├── upper/ # Per-directory overlay upper layers
│ ├── node_modules_cache/
│ │ ├── node_modules/
│ │ ├── apps/app/node_modules/
│ │ └── ...
│ └── python_venv_cache/
│ └── services/api/.venv/
├── work/ # Overlay work directories
└── stage-*/ # Temporary bind mount staging (during save)
Troubleshooting
If you decide to implement this, here are the issues you're most likely to run into.
"Failed to mount squashfs via FUSE" - Check that /dev/fuse exists in the container with ls -la /dev/fuse. If it's missing, the runner pod may not have privileged: true or a /dev/fuse device volume. Also verify that squashfuse is installed.
"Failed to copy cache file" / "Stale file handle" - An EFS NFS handle expired under concurrent access. The retry mechanism (3 attempts, 2s backoff) usually resolves this. If it's persistent, check your EFS throughput limits or burst credit balance.
"Module not found" errors after cache hit - The squashfs image may have stale content. Purge the cache and let it rebuild on the next run.
"Could not find Nx modules" - The FUSE mount isn't visible to subsequent steps. Verify that RUNNER_TRACKING_ID="" is set before launching the FUSE daemons, allow_other is in the mount options, and user_allow_other is in /etc/fuse.conf.
Cache miss on every run - Check the cache key computation. If your lockfile changes on every run (for example, Verdaccio URL rewrites not being reverted), the hash will never match. Make sure to restore the original lockfile content after install.
Docker bind mounts fail - Docker bind mounts resolve through the shared work emptyDir, not through FUSE mounts. If Docker containers can't see files that should be there, check that the work volume is correctly shared between the runner and dind containers in the Helm values.
Results
On cache hit, dependency restoration now takes ~3 seconds - down from 60 seconds with tar extraction, or ~30 seconds even on NVMe. For a team running hundreds of CI jobs daily, that's a significant amount of developer wait time eliminated - and the solution is completely transparent to the workflows that consume it.
The simplification to exact-or-miss (dropping partial hits) was the right call. The added complexity of incremental saves bought very little in practice, since lockfile changes are relatively infrequent and a full cache miss only adds the cost of one npm ci run.
If your self-hosted runners are spending tens of seconds extracting cached dependencies, consider whether mounting might be a better fit. The upfront investment in setting up FUSE and EFS pays for itself quickly at scale.




