Beginner
As an aspiring data scientist or machine learning researcher/practioner you will need to develop proficiency in the following core data science tools.
- Programming in either Python (or R), Bash, and SQL.
- Version control using Git and either GitHub or GitLab.
- Container-based work flows using Docker.
We cover all of the above (and more!) as part of the Introduction to Data Science Workshop Series offered by the KAUST Visualization Lab (KVL) in both the Fall and Spring semesters.
- Introduction to Python for Data Science
- Introduction to R for Data Science
- Introduction to SQL for Data Science
- Introduction to Shell for (Data) Scientists
- Introduction to Version Control using Git for (Data) Scientists
- Introduction to Conda for (Data) Scientists
The lesson materials for the Introduction to Data Science Workshop Series draw heavily from the Software Carpentry lessons on Python, R, SQL, Git, and Bash as well as a number of domain-specific data-science lessons developed by Data Carpentry that are built on this core tool stack.
Intermediate
Once you have a basic understanding of the core data science tool stack. The next step is to start gaining experience working with the main Python machine learning libraries such as Pandas, Scikit-learn, TensorFlow (Keras), PyTorch, and PySpark.
While an aspiring data scientist needs a grasp of all the core data science tools, you do not need to master all of these advanced machine learning libraries. Instead you should focus your time on learning only those libraries that are most relevant for the problems that you are working on in your research.
Books
There are a number of excellent books all of which should be available from the University Library and which have source code available for download via GitHub.
- Python for Data Analysis, 2nd Edition (source code) [Library link]
- Python Data Science Handbook (source code) [Library link]
- Hands-on Machine Learning with Scikit-Learn and Tensorflow, 2nd Edition(source code) [Library link]
- Deep Learning with Python (source code) [Library link]
Courses
Python-based on-line data science training courses to dig deeper into a particular topic.
- Practical Deep Learning for Coders, Part I (Fast.ai)
- Practical Deep Learning for Coders, Part II (Fast.ai)
- Introduction to Machine Learning for Coders (Fast.ai)
- Applied Data Science with Python (Coursera, University of Michigan)
- Deep Learning (Coursera, DeepLearning.ai)
- TensorFlow in Practice (Coursera, DeepLearning.ai)
- Genomic Data Science (Coursera, Johns Hopkins University)
- AI Programming with Python (Udacity Nanodegree)
- Become a Data Scientist (Udacity Nanodegree)
- Become a Machine Learning Engineer (Udacity Nanodegree)
Advanced
Once you have experience with Pandas, Scikit-learn and one of PyTorch or TensorFlow (Keras), then you can start exploring some of the cutting edge machine learning libraries such as Nvidia RAPIDS, Dask, XG-Boost, and Numba.
Again, the point is not to master all of these but rather to learn to use the libraries that help you solve your research problems most efficiently.
Courses
The following courses are deep dives into advanced topics.