Anton Bogomazov | June 29th, 2023
Version control is crucial for managing source code files. It allows developers to track changes, collaborate seamlessly, and revert to previous versions when necessary. The same principles apply to data science projects, but traditional version control practices may not always be suitable. For instance, they often fall short when dealing with the popular Jupyter Notebook format, which represents notebooks as JSON files. However, working with the raw JSON in a text editor is not a good idea due to the content of notebooks not only text but also source code, rich media output, and metadata. As a result, users typically interact with GUI with the rendered content of the notebooks.
This approach poses challenges when attempting to view the differences between notebook versions. Users often encounter lengthy diff outputs, even for minor changes, as a single line code modification can change the entire cell output of hundreds of lines. Consequently, the diff becomes verbose, uninformative, and hard to read, making the change opaque.
In one of our projects, primarily focused on Jupyter Notebooks, we encountered these aforementioned issues and tried to find a solution. Our goal was to see the differences between rendered notebooks, rather than comparing the JSON file content directly.
A naive solution would be clearing the outputs before committing changes, but this only partially resolves the problem. Even though we don’t see the changes in the output, because there isn’t one, diff is still poorly readable. Additionally, storing outputs is often required and can be useful to avoid requiring users to re-execute all cells with complex computations when they simply want to review the outputs.
Fortunately, the Jupyter ecosystem offers several tools for working with notebooks, and one of them, nbdime, provides the desired solution. With nbdime, it becomes possible to view the differences between different versions of Jupyter Notebook files in a human-readable format, both in the console and through a graphical interface. Importantly, nbdime does an excellent job of identifying changes not only in code cells but also in their outputs.
Using nbdime, we solve the problem of effectively viewing differences between commits locally. However, when it comes to code review on services such as Bitbucket, the default diff output remains uninformative.
In such cases, it is worth considering commercial VCS extensions or leveraging another Jupyter tool called nbconvert. This tool allows exporting notebook content into various formats. The best practice is to generate a plain text file containing the code alongside an HTML render of the notebook, including its outputs. This allows for separate tracking of code changes and also provides users with a convenient way to view the notebook outputs, including those from previous versions. It is also a good idea to automate this process and add a commit hook for generating Python and HTML files from the notebook. This ensures the consistency of all artifacts and enhances the developer experience by streamlining the workflow.
In conclusion, version control is essential for managing Jupyter Notebooks effectively. While traditional version control practices may not seamlessly handle Jupyter’s JSON-based format, tools like nbdime and nbconvert offer valuable solutions. By adopting these practices, teams can maintain a comprehensive record of code modifications, facilitate thorough reviews, and ensure a smoother and more efficient collaboration process.
Contact us today to discuss your project and see how we can help bring your vision to life. To learn about our team and expertise, visit ‘About Us‘ webpage.