nf-core/bytesize

Join us for our weekly series of short talks: “nf-core/bytesize”.

Just 15 minutes + questions, we will be focussing on topics about using and developing nf-core pipelines. These will be recorded and made available at https://nf-co.re It is our hope that these talks / videos will build an archive of training material that can complement our documentation. Got an idea for a talk? Let us know on the #bytesize Slack channel!

Bytesize: Converting Python scripts into packages for PyPI, Bioconda & Biocontainers

This week, Phil Ewels (@ewels) will show you how to take a Python script and turn it into a stand-alone command-line tool, ready for distribution via the Python Package Index (PyPI).

You can download a .zip file of the "before" and "after" code examples Phil demoed here.

This is a good thing to do for a few reasons:

  • More people can use your scripts - not just within Nextflow
    • This is useful for development, for stand-alone testing
    • It's useful for people using other workflow managers
    • It helps when users are testing a method / debugging with small sample sizes
  • It allows scripts to be released under different licenses to the pipeline itself
  • Software packaging, that is providing container images with all requirements, is handled automatically

Even if it's a small script that you think no-one will ever use outside of your pipeline, it's easy to do and you don't lose anything 🙂

Once released in PyPI, releases via Bioconda are simple (see Bytesize 40: Software packaging). Once in Bioconda, software will be available for Conda users, but also Docker + Singularity, via the BioContainers project.

Video transcription

Note: The content has been edited for reader-friendliness

0:01 Hello, everyone, and welcome to this week's bytesize talk. I'm very happy to have Phil here, who is talking today about converting Python scripts into packages for PyPI, Bioconda, and biocontainers. It's your stage, Phil.

0:17 Thank you. Hi, everybody. Thank you for joining me today. I'm going to have a little bit of fun together, hopefully. Today's talk was inspired by a conversation that's come up a few times within nf-core, which is when people have got scripts within a pipeline, so typically within a bin directory, or it could be within the exact shell block of a process. Instead of bundling that script with the pipeline, we instead prefer to package that script - or set of scripts - as a standalone software package instead. There are a few different reasons why we like to do this. Firstly, it makes the package and the analysis scripts available to anyone to use, even if they're not using Nextflow and not using this pipeline, so that's for the greater good of a community. More reusability, more visibility. It can sometimes help with licensing because we're no longer bundling and modifying code under potentially a different license within the nf-core repo, so the nf-core repo can be MIT and can just call this external tool. It also helps with software packaging, as Fran mentioned. For free, then we get a Docker image, a Singularity image, a Conda package, with all of the different requirements that you might need, so you don't need to spend a lot of time thinking about all the different, setting up custom Docker images and all this stuff. You just package your own scripts as its own standalone tool and you get all of that stuff for free, so, much better. All the maintenance can sit alongside the pipeline rather than integrated into the pipeline. It's a nice thing to do and for me, the main reason is that first one, which is that it makes the tool more usable for anyone, not necessarily tied to running within Nextflow, which I think is great because it's nice to use tools on a small scale and then to scale up to using a full size pipeline when you need it.

2:16 I've told people in the past that this is easy, which it is, if you've done it lots of times before. But I thought it's probably time to put my money where my mouth is and actually show the process and hopefully convince you, too, that it isn't so bad. Now a few things to note before I kick off, firstly, I'm going to live code this. I have run through it earlier, so I've got a finished example on my side, which you can't see, which I will copy and paste from occasionally and hopefully refer to, if everything really goes wrong, but in the words of SpaceX, excitement is guaranteed because something will blow up at some point. So join me on that. Secondly, there are many, many ways to do this. My way is not necessarily the same as what I'm going to show and there are better ways to do things and probably recommendations that you should listen to from other people that are much better than mine. My aim today is to try and show you the easiest way to go from Python scripts to something on Bioconda, and I want to try and make that beginner friendly and as bytesized as possible.

3:28 Let's start by sharing my screen up here and we will kick off. Spotlight my screen for everybody, so hopefully you can still see my face. To start off with a famous XKCD comic about Python environments, which are famously complicated packaging environments. We're going into something which is known for being difficult and varied, but that's fine. I'm going to keep it as simple as possible and you don't need to worry about all this stuff. I've got a little toy Python script here, it doesn't do very much, it just makes a plot and I wanted some input, so it takes a text file here, delete that now, called title.txt with some text in it. It reads that file in, sets it as a variable and sets the plot title to whatever it found and then it saves it. This is our starting point, I can try and run this now. If I do python analysis.py, there we go, we've got our plot and my nice plot, so it works, first step. This is where I'm assuming you're starting off, is you have a Python script which works.

4:45 We have a few objectives to do, to take this script into a standalone Python package. Firstly we want to, as far as possible, make things optional and variable, so instead of having a fixed file name with a string like this, we want a better way to pass this information in to the tool, so we want to build the command line tool. We want to make it available ideally anywhere on the command line on the path, so make it into a proper command line tool rather than a script which you have to call using Python. We can call it "my_analysis_tool" or whatever and run that wherever. Once we've done all that stuff we want to package it up using Python packaging so that we have everything we need to push this package onto the Python package index, and we're going to focus on that. Once we've got this as a tool on PyPI, where anyone can install it, then the steps from PyPI to Conda is fairly easy. Once it's on Conda you get biocontainers for free which is the Docker image and the Singularity image. Really our destination for today is just Python packaging, just the PyPI. There's another talk, it's fairly old by now, but it's still totally valid, by Alex Peltzer on nf-core bytesize. It takes you from that Bioconda packaging steps, so you can follow on this this talk with that one. Hopefully that makes sense.

6:14 First steps first, let's try and make this into a command line tool. Now there are a bunch of different ways to do this, probably the classic Python library to do command line parsing is called argpass, which many of you may be familiar with. Personally I've tended to use another package called "click", and more recently I am tending to use a package called "typer" which is actually based on "click". If I just use the right browser, this is URL, "typer"."Typer", gosh it's quite big, on a bigger screen it looks, I'll just make my window bigger just for a second so not reading anything here but just seeing what the website really looks like. It's got a really good website, it explains a lot about how to use it and you can click through the tutorial here and it tells you about everything, what's happening, why it works and the way it does and how to build something. We can start off with this, the simplest example, and we're going to say import typer here. Go up to the top, import typer, wrap our code in a function name. I can't copy from the VS code browser apparently, so I'm going to indent all of this code. Then I'm going to copy in that last bit which was there... my other window... down at the bottom.

7:55 What's happening here? I'm importing a Python library called "typer", which is what we're using for the command line tool, I've put everything into a function which is just called def __main__ and then at the bottom I've said if __name__ == "__main__", so this is telling Python if this script is run directly, use "typer" to run this function. If I save that, now I can do Python analysis and nothing will happen, it should just work exactly the same, but I can do python analysis --help and you can see we're starting to get a command line tool come in here.

8:27 Next up, let's get rid of this file, we don't really care about it being in the file, that was just a convenience, so I'm going to say let's instead pass the title as a command line option. With "typer" we just do that by adding a function argument to this function and I can get rid of this bit completely. To prove it I'll delete that file as well. Let's try again, do python analysis --help and sure enough now we have some help text saying, hey they are expecting a title, which is text and we have no default. If I try and run it without any arguments it will give me a nice error message. Now if I say "hello there", it's passed that in and our plot has a different title. That is our first step complete. We have a rudimentary command line interface and we have got rid of that file and we've now got command line options which makes it a much more usable flexible tool and that was not a lot of code I think you'll agree with me. With "typer" you can do many more things. You can obviously add lots more arguments here. You can say it should be an integer or boolean and it will craft the command line for you. You can use options instead of arguments so --whatever. You can set defaults, you can write help texts, loads of stuff like that. As you your tool becomes more advanced, maybe you dig into the type of documentation a little bit and learn about how to do that, but that's beyond the scope of today's talk.

10:04 Next up, let's think about how to make this into an installable package and something we can run on the command line anywhere, those two things go together. If someone else comes and wants to run this package they're going to need to be able to import these same python packages, so I'm going to start off by making a new file called "requirements.txt" and I'm going to take these package names there and just pop them in there. We'll come back and use that in a minute and in the short term, if someone wanted to, they could now do pip install -r requirements.txt and that would install all the requirements for this tool. I'm also going to start moving stuff into some subdirectories and by convention I'm going to put it into a directory called "source". But it doesn't really matter, you can call it whatever you want. I'm going to call it "my_tool" and I'm going to move that python file up into that directory there. I'm also going to create a new file called __init__.py. This is a weird looking file name and it's a special case. By doing this in python, it tells the python packaging system that this folder's directory behaves as a python module, which is what we want to install later and so I can write add a docstring at the top saying "my_amazing_tool". I'm actually going to not put anything in here for now apart from a single variable which I'd put here by convention, but really you can do whatever you want. I'm going to call it again, use dunder - so double underscore -, version, double underscore, and also you know, semantic versioning 0.0.0.1 dev. We'll come back and use this variable a bit later, but for now it doesn't do anything.

11:57 What else? We want to make the typer example slightly more complicated. We're gonna now create a typer app like this. We're going to get rid of this bit at the bottom, because we don't actually need that anymore if we're not going to be running it as a script. We're not going to be calling that python file directly. Get rid of that. We're going to now use a python decorator called app.command() here, to tell "typer" that this is a command to be used within the command line interface. This is a normal secondary set, but a first very simple example is so simple that you almost never use that with "typer". This is what you always do and then you can have multiple functions here decorated with command and you can have multiple sub-commands within your CLI, using that way and groups of sub-commands and all kinds of things. With nf-core we have grouped sub-commands. You do nf-core module updates for example and those are separate sub commands, so that's how you do it here. But for now, this would work in exactly the same way as the example I showed you a second ago.

12:58 I'm going to add... because this is going to be a python package, it's really important to tell everybody about how to use it. I'm going to create a new LICENSE file. I am a fan of MIT, so I'm going to make it the MIT license and just paste in the text there that I've grabbed off the web and I'm going to make a README file, because this is going to turn up on github. We want people to know about what the tool is and how to use it, when they see the repo.

13:27 Okay hopefully you're with me, that's all the simple stuff. Now we'll get on to a slightly more complicated bit about how to take this and make it installable. This is one of the bits where it gets very variable about how you can do it. Typically within python you can use a range of different installable python packages to do your python packaging. It's quite meta. There's a very old one called "discutils" which you shouldn't use and there's one called "setuptools" which is most common. That's what I'm going to use today. Other people like packaging setups such as one popular one called "poetry". There are quite a lot of them so if you have a preference, great, go for it. Maybe in the discussion afterwards people can suggest their favorites, but for now I'm going to stick with setuptools and I'm going to say setup.py, which again this gets a bit confusing, but you don't necessarily need and "setup.cfg". I should dump in here - you don't need to remember how to do this. I don't remember how to do this. I don't think anyone really remembers how to do this. I do some browsing, type in "setuptools.py.io", you can see there's quite good docs on this website for setuptools. They tell you how to do everything, they talk through it's quite easy to read and they also talk through all the different options of how to build this stuff. You can do it with what's called a "pyproject.toml" file, which is probably what I'll start doing soon when it becomes slightly more standard. There's a setup.cfg file, which is what I'm going to do now and there's also some documentation about the old school way of doing it which is "setup.py". Tor now the "setup.py" file is just for backwards compatibility.

15:09 I'm going to do exactly what it tells me to do here. I'm going to say import setuptools, setup(), save and then I just forget about this file and never look at it again then everything else goes into this setup.cfg file and you can work through the examples here. For now I'm going to cheat for the sake of time and copy in the one I did earlier and just walk you through what these keys are quickly. Again I always copy this from the last project I did but you can copy it from the web very easily. "name" is important, "version" is important, because when you're updating a python package it needs to know which version number it is. And this then is using the special variable I set up here. Now if you look where it is, it's in the python module I made called "mytool" and the variable number is version. Here I'm saying, use an attribute, I could hard code it in this file if I wanted to, but I'm using it as an attribute and I'm using this variable which is under mytool version. You could call that whatever you want or you could just hard code it in this file. "Author", "description", "keywords", "license", "license files", "long descriptions" say it's markedown, that's just what shows on the PyPI website. "Classifiers" which are just categories, I always copy these without thinking. You can probably think a bit more about it if you want to. There is some slightly more interesting stuff down here. The minimum required version of python, which might be important for you. Where you put your source code, in this case I say look for any python modules you can find and look in the directory called source. If you call that something different you put that here and then that's looking for .init files like that. Then saying we require a bunch of other python packages here. Here I'm saying look at this file called requirements.text. If you didn't want to have that file for whatever reason you can also just list them in this file here as well.

17:12 Finally "console scripts". This is the bit which actually makes it into a command line tool and here we say I want to call my tool myawesometool. When someone types that into the command line, what I want python to do is to find the module called "mytool", which we've created here, with the init file. I've actually got this script called "analysis" here. Again, this file name could be whatever you want. Then look for a variable called "app". Here our variable is called. But I could also put a function name and stuff here as well, if I wanted to. For typer I'm going to say ".app".

17:53 Now, python will know what to do when I install my tool and... moment of truth, let's try and install it and see what breaks. Pip python package index uses pip and I'm going to say pip install. I could just do full stop for my current working directory and that will work, but I'm actually going to add the -e flag here, make it editable. What that does is, instead of copying all the files over to my python installation directory, it soft links them and that's really useful when developing locally because I can make edits to this file, hit save and adapt the reinstall tool every single time. I just am always in the habit of using -e pretty much all the time. Let's see what happens... yeah, it broke. "Setup not found". That's because I got the import wrong. from setuptools import setup and then set up search. I could have done set up like that, that should work as well. Let's try again. Great, you can see it's running through all those requirements. It's installing all the back end stuff which is like matplotlib and "typer", and it installed! So now, what did I call it? myawesometool! If I do myawesometool --help... Hooray! It works! Look at that, we've got a command line tool! Now I can run this wherever I am on my system. I don't have to be in this working directory anymore, doesn't matter if I... lets give an example... do testing. If I could do myawesometool "This is a test". There we go. Now we've got that file created in there, because that was my working directory and sure enough, I got a nice title. Brilliant!

20:03 We have a command line tool, it installs locally, it works and it's got a nice command line interface. We're nearly there. The final thing then is to take this code and put it onto the python package index. If you start digging around on google, you will find instructions on how to do this and it will say run a whole load of command line functions. Run those, do this and that will publish it. There's a sandbox environment where you can test first and you have to sign up to PyPI, obviously, and register and create a project and everything. But my recommendation is to keep things simple and the only way I do it now is to do all of this through github actions and automate your publication of your package. That's all I'm going to show you today, because I can walk you through that quite easily and it's the same logic. If you've not used github actions before, the way it works is, you create a directory called .github - it's a hidden directory - and a subdirectory called workflows. In here I'm going to create a new file, which can be called anything "deploy-pypi.yaml".

21:16 Then I'm going to cheat and copy, because otherwise it's going to take me a while to type all this in. I'm going to walk you through it. This is a yaml file that tells github actions what to run and when to run it. We have a name up here, which can be anything, and firstly we have a trigger. This tells github: run this github action. Whenever this repository has a release and the event type is published, so whenever you create a new release on github and you click publish, this workflow will run and it'll run on a default branch. Then we have the meat of it. What is it actually doing? It's running on Ubuntu. It's checking out the source code first and setting up python. Now I install the dependencies manually here. I'm not totally sure if this is actually required or not, but it was in the last github actions I did, so I thought I'd do it again. First command is just upgrading pip itself and setting up setuptools and stuff. Then we do the pip install . command again, just to install whatever's in the current working directory. Now on github actions your tool is installed and then we run this python command with setup.py, which is just calling setuptools and saying sdist, the setuptools distribution and create a bdist_wheel. We don't need to know what that means or why it's there, but that's just the files that the python package index needs. Now it's built the distribution locally and then finally we publish it.

22:40 You can see where I copied it from. We publish it to the python package index. This is a check just to make sure if anyone has forked your repository. Don't bother trying to do this, because it obviously won't work. I usually just put this in, check if your github repository is called and then use this python package index action, which is a github action that someone else has written. I'm using a password and this is a github action sequence and this is an api token that you can get from the python package index website when you're logged in. That gives the github actions all the credentials it needs to be able to publish the python package for you. That's it. If everything works well, you stick all this on github you make it all lovely, you hit release and then you will be able to watch that workflow running and it will say "workflow published".

23:33 Remember to change this version when you run it more than once, because if you try and publish the same package twice with the same version number on python package index, it will fail. As long as you bump that, then everything should work and you should end up with a package on on pypi. When you have that package you'll be able to do name, that's I think that's what python package index uses. You'll be able to pip install mytool from anywhere. Anyone will be able to do that and it will just work and that's it. At that point you can pat yourself on your back, think how amazing the job you've just done is and how anyone can now use your analysis tools. Prepare yourself an onslaught of bug reports to github and take the next step and scaffold that pypi recipe into bioconda and do all the last stuff. But like I say, that's in a different talk and I'm not going to swamp everyone by talking about that too much today. Hopefully that made sense for everybody. Shout if you have any questions and I'd love to hear what workflows other people have and whether I made a mistake and if you think I should do it in a different way and if your way is better.

24:42 (host) Thank you so much. It's nice to see how some of the magic actually happens in the background. Do we have any questions from the audience?

(question) I've got one. Have you tried cookie cutter to automate all of this?

(answer) When I was prepping this with like five minutes to go, I was desperately trying to find a link for a really nice project which I've seen. I've spoken to the authors and I cannot remember the name of it. There's a few of them floating around but there's one definitely for bioinformatics where you can use a cookie cutter project and it scaffolds an entire python package index project for you, with all of this stuff in place. It's probably much better and quicker. I purposefully chose not to show that today, because I was thinking of going from someone who already has a script which is working through, and trying to explain what all the different stuff is doing. If you're starting from scratch I would absolutely do that and if anyone has any good links for projects or can remember the projects I'm talking about, please post them here or in slack.

(question cont.) I'll just drop the link in the chat. If someone doesn't know what we're talking about.

(answer cont) That links for cookie cutter itself, right, which is just like a generic templating tool. There are cookie cutter projects which people have created like template repositories. Specifically for python, if that makes sense.

26:09 (question) We do have another question in the chat. Someone is asking why not pyprojects.toml?

(answer) This is something else I was debating on the start. This is a bit of history here. When I started creating my first python projects you always used that setup.py file and you still can. It's a bit like how Nextflow config files are just a groovy script, where you can do whatever you like. setup.py is the same. It's just a python script, where you can do whatever you like.Which is wonderful and horrifying! Slowly over the last... the python community moves slowly... so for the last many years, there's been a move away from that way of doing things into more standardized file types and there are two which are being used: there's a setup.config file, which is exactly the same thing but in a structured file format, and the other one is pyproject.toml, which is the newer and better way of doing things. pyproject.toml is nice because it's also a standard for many other python tools with configs. If you want to use black to lint your code, which you should, because black is amazing, you'll put your settings in pyproject.toml. If you use, I don't know, mypy for type linting or any of these flake8 tools or whatever and it will be linting tools and stuff. They all stick their settings in pyproject.toml, which is great because you have one config file for everything to do with your python project. That is much nicer and you can also do all of your setuptools python stuff in there. There are a couple of things which I found I think are missing. Correct me if I'm wrong, I don't think you can point it to a requirements.txt file for all requirements. It's quite useful having that file sometimes, maybe it doesn't matter... I think the setup tools website says it's like in beta and it might change, so I thought I'd play it safe today and go for setup.cfg, which is newish, but fairly safe. But yeah, pyproject.toml is, if you can make it work for you, probably a nicer way to do it.

28:13 (host) We have some more comments. There was a link posted to Morris' cookie cutter package which has not been tried out, at least not by the person who posted it. It says ironically flake8 can't actually work with settings from pyproject.toml or at least couldn't a couple of months ago.

(speaker) Cookie cutter, this might look familiar to anyone who's used the nf-core template. We used to use cookie cutter for nf-core back in the early days and still use the underlying framework, which is called ginger. That's where this double squiggly brackets comes from, it's a templating system as you can see. Here you've got all these different settings, therefore with license options and a name and stuff and then these will go into all these double bracket things. The idea is, you do cookie cutter run or cookie cutter, I can't remember what the command is now build. Then you give it this github url and it will ask you a few questions which will just replace these defaults here. Then it will generate this package here, but with all the template placeholders filled in.

29:26 (host) Great! Do we have any more questions? It doesn't seem so. Thank you very much for this great talk. Before we wrap this up entirely I also have something to mention. Next week's bytesize talk is going to be one hour late. I will also post this again in the bytesize channel. Very interestingly there will be a talk from people that were part of the mentorship program. The deadline for the mentorship program just got extended, so it's actually for anyone who is still questioning if they should join or not. This is your chance to actually listening to people who have been part of it and they give some impressions. With this I would like to thank Phil again I would like to thank everyone who listened. Of course as usual, I would like to thank the Chan Zuckerberg Initiative for funding our talks and have a great week everyone.