Intro

Hello, World! In my first real post I wanted to share some thoughts about resources for getting started in Python or data science. This has been nearly a 3 year process for me and could have been done much more efficiently; in this post I'll point out a few ways I would do things differently if I were restarting today. Hopefully, some of these tips can be helpful to others who are learning at their own pace. I will try to keep this post updated with new resources as I find them. Keep in mind, as I'm still actively learning about many tools and my insights will no doubt be incomplete. If you've been through this process and have recommendations for things to improve or change please share them in the comments!

Python Install and Package Management

TL;DR: Read Ted Petrou's "Anaconda is bloated."

First things first - 99% of what I write here will be about Python. As most python beginners do, I started off by downloading Anaconda Navigator to keep things simple. The first mistake I made - against the advice of many experienced python users - was neglecting to use python environments. The process of installing, uninstalling, and reinstalling packages eventually led to a messy environment. It was time to restart. The simplest thing to me seemed to be (not in reality) a reinstall of Python. Instead of learning new Python packages, I was spinning my wheels. Eventually, I found Ted Petrou's "Anaconda is bloated", which gives a fantastic overview of how to get a simple, effective data science set-up running in Python. Pairing this with a program like Visual Studio Code can get you coding quickly with the ability to start over fresh when/if the time comes. Next let's talk a bit about interfaces.

User Interface (IDE) Selection

TL;DR: If starting over today I would install Jupyter Notebooks, but use Visual Studio Code as my daily interface. VS Code is a great text editor, but also gives you the power of IPython Notebooks and the freedom to experiment with other innovative interfaces like Streamlit.

Of course, you can always write Python code in a simple text editor and run your scripts from terminal, but modern IDEs offer a lot of built-in efficiencies. When I started down the path of learning Python, the idea of not having an official IDE, like MATLAB or RStudio, made me a little squeamish. I jumped straight into Jupyter Notebooks for the simplicity and was ecstatic when I learned about Jupyter Labs, which takes the interactivity of Jupyter Notebooks a step closer to the full IDE experience. The IPython Notebook, which is that basis for the Jupyter's Python interface is widely used across data science platforms like Kaggle and Coursera, which makes the transition from your machine to other platforms more seamless. The only thing I didn't like about the Jupyter environment was that I was locked into that interface.

What pushed me to try the oft-recommended VSCode was the introduction of a new interactive interface called Streamlit (more on Streamlit in a bit). In order to try Streamlit I needed a text editor that was separate from Jupyter Notebooks. What I found in VSCode was a text editor that runs IPython Notebooks, has a built in terminal for package/environment management (or git), and of course efficiently edits text files. I'm still learning many of the creative editing shortcuts and I'm sure I am not using the extensions to their full capability, but VS Code has become my primary workspace.

Lastly, I'll touch on Streamlit, a relatively new interface with a completely different take on interactive code. Streamlit runs the full code of your Python script on every pass unlike IPython Notebooks which execute a single cell at a time. Streamlit allows compute-intensive functions to be cached so that they don't need to be re-run over and over. The shining feature that made me fall in love with Streamlit is its ability to add interactive features like buttons, sliders, drop-downs, or text input with a single line of code. These features make data exploration much faster in Streamlit and also provide a framework to build simple applets for people looking to deploy ML algorithms quickly. This video from the co-founder and CEO gives a great introduction to the interface.

Courses, Podcasts, and Other Resources

TL;DR: Start with sites like Kaggle that incentivize project building. Fill in knowledge gaps with resources like courses, but don't fall into the trap of doing courses endlessly with no applied experience.

I have spent the last 3 years or so learning python and then expanding into the world of machine learning and now deep learning. I can't say that it's been the most efficient process and I think an important take-away is prioritizing efficiency when choosing your learning path. Not everyone has the same amount of time to dedicate. While a bootcamp might get you where you want to be in 6 months, it can be a large time (and monetary) commitment. Career Karma is an interesting place to start if you are interested in bootcamps but don't know where to start. For the rest of us, Dan Becker (creator of Kaggle Learn) explains in this episode of "Chai Time Data Science" that the first priority for anyone looking to move into a data science career should be building projects. Keeping that in mind, below you will find some of my favorite resources split out into those that are more conducive to building things quickly and others that are more helpful when you need to take a step back and just absorb some of the technical details.

Quick hitters:

  • SoloLearn - Learn syntax for Python (don't spend too much time here!)
  • Agile Geoscience Kata (for the geologically-inclined) - These small coding challenges are a great way to learn some python basics, especially if you are a geoscientist who works with subsurface data. Likely not as useful for others.
  • Kaggle Learn - these courses have been specifically designed to get users into real use case examples of how machine learning is used to solve problems. The lessons are short, practical, and don't dwell on the details, but are enough to get you building. Coursera (see below) is a great alternative for more detailed learning.
  • FastAI - FastAI is organized in a similar way to Kaggle Learn. Jeremy Howard gets you training deep neural networks within the first video lecture and provides insights along the way. Again, Coursera is an alternative reference if you eventually want to learn to build your own networks from scratch.
  • TWIML Podcast - The TWIML AI podcast offers a large catalog of interviews with data scientists applying ML/DL. These may be difficult to follow along until you have a bit of experience, but the TWIML AI community also offers study groups and other resources that you may find useful.
  • Kaggle Competitions - As your skill set grows you can start applying what you are learning and discussing your results with the helpful community on Kaggle. My personal favorite is the yearly March Madness competition, which was unfortunately canceled due to the Covid-19 pandemic. Will try again next year!

More Detailed/Traditional Learning

  • Machine Learning Guide (Podcast) - Very high level overviews of data science and types of languages/algorithms. Similar content can be found on the web, but this may be a useful resource if you are a beginner looking for something to listen to on your commute or when you can't be on a computer.
  • Coursera - If you need more detail than Kaggle Learn the applied lessons, there is a lot of content on Coursera. A few examples of commonly recommended courses:
    • Python for Everybody - The first few courses of this specialization will give you a crash course in python. It will provide more detail and active learning than something like SoloLearn, which is designed to be short and sweet.
    • Machine Learning great content as an introduction to the fundamentals of machine learning, but unfortunately taught in Octave or MATLAB. There is significant overlap between this course and the Deep Learning course below - you may actually try that first and see if you need to back track.
    • Deep Learning - This course will introduce fundamental concepts related to building and training deep neural networks. Expect more theory and free-form coding than the FastAI course. As you get into coding in TensorFlow, this intro to TensorFlow fundamentals is also very useful document.

Hopefully, these resources can be of help or at least get you moving in the right direction as you start your personal data science journey!