How to track large files in Github / Bitbucket? Git LFS to the rescue
Git repositories are cool and fancy when the size of your repository is small and less number of files, but what do you do when your repository grows very large I mean very very large say 5G or 10G, or your repository increases to have many many folders and many many files inside them.
Of course, you can convert it into an artifact and push it or binary repositories like JFrog or Nexus. But you really want to manage it via git since it’s kind of free and really cheaper than those other two solutions.
Github and Bitbucket only allow you the size limit of 2 GB. But, how can you check in large files into Github / Bitbucket?
There is a cheaper way to do this. Here comes Git LFS (Large File Storage).
So you must be wondering, what is Git LFS?
Git LFS is a feature of git which allows you to check-in large files like videos, datasets, graphics, isos, large binaries. Git stores those files' contents into the remote server. it uploads the content of files to the remote repository and uses pointers instead of actual files or BLOB (Binary Large Objects). So, instead of writing to a file when you commit, you will be writing to the pointer file.
Git LFS basically allows you to store large blobs or files into git saving your git repository space and it is often people's choice to push large files into Github / Bitbucket.
It is super easy to get started.
issue following command
Check if git LFS is installed using the following command
Now you are ready to set up and start tracking large files into git
To track a large file like .iso to git LFS, just run the command
and git LFS will track all files with .iso extension to LFS repo and when you push the files, a pointer to file will be stored in GitHub repo, instead of the actual file. This will save a lot of space in your GitHub repository.
If you want to track a folder, then you run
if you want git LFS to track folder recursively,
Note: This will only track folder under a certain depth, to track everything, it is better to track by using the extension which is shown below.
So, what if there are 1000s of small files and it is making my repository size go above 2G?
There is a solution for that as well. You can track all the files into git LFS
Create a file named “.gitattributes” in the main folder.
Run the following command to get the list of all the extensions in your folder.
It will give you output like this
*.xlsx filter=lfs diff=lfs merge=lfs -text
*.xml filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
Copy and paste the output into that “.gitattribute” file and as usual, do a git add and commit.
That’s it, now check your repository size, it's still small but your large files are there with a hashed value which is the pointer to the LFS server where the actual file is.
When you clone the repo, all the contents will be downloaded to your workstation from git LFS.
Now, you know how to track files using git LFS, but how do you untrack it. As you have guessed, it's super easy, just issue the following command
Who can use this?
- Data scientists to store large datasets.
That’s it, I hope it was helpful.