I was wondering about machine learning/data science project structure/workflow and was reading different opinions on the subject. And when people start to talk about workflow they want their workflows to be reproducible. There are a lot of posts out there that suggest to use make for keeping workflow reproducible. Although make
is very stable and widely-used I personally like cross-platform solutions. It is 2019 after all, not 1977. One can argue that make itself is cross-platform, but in reality you will have troubles and will spend time on fixing your tool rather than on doing the actual work. So I decided to have a look around and to check out what other tools are available. Yes, I decided to spend some time on tools.
make
. I encourage you to go and check his posts first. I will mostly use his code, but with different build systems.make
, following is a references for a couple of posts. Brooke Kennedy gives a high-level overview in 5 Easy Steps to Make Your Data Science Project Reproducible. Zachary Jones gives more details about the syntax and capabilities along with the links to other posts. David Stevens writes a very hype post on why you absolutely have to start using make
right away. He provides nice examples comparing the old way and the new way. Samuel Lampa, on the other hand, writes about why using make
is a bad idea.A -> B -> C
. Will target C
be rebuilt if B
changes? If A
?make
, but instead of writing makefile directly you write a CMake file, which will generate the makefile for you.cmake_minimum_required(VERSION 3.14.0 FATAL_ERROR)
project(Cmake_in_ml VERSION 0.1.0 LANGUAGES NONE)
SET(IRIS_URL "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" CACHE STRING "URL to the IRIS data")
set(IRIS_DIR ${CMAKE_CURRENT_SOURCE_DIR}/data/raw)
set(IRIS_FILE ${IRIS_DIR}/iris.csv)
ADD_CUSTOM_COMMAND(OUTPUT ${IRIS_FILE}
COMMAND ${CMAKE_COMMAND} -E echo "Downloading IRIS."
COMMAND python src/data/download.py ${IRIS_URL} ${IRIS_FILE}
COMMAND ${CMAKE_COMMAND} -E echo "Done. Checkout ${IRIS_FILE}."
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
)
ADD_CUSTOM_TARGET(rawdata ALL DEPENDS ${IRIS_FILE})
IRIS_URL
, which is exposed to user during configuration step. If you use CMake GUI you can set this variable through the GUI:IRIS_FILE
as it’s output. In the end, we define a custom target rawdata
that depends on IRIS_FILE
meaning that in order to build rawdata
IRIS_FILE
must be built. Option ALL
of custom target says that rawdata
will be one of the default targets to build. Note that I use CMAKE_CURRENT_SOURCE_DIR
in order to keep the downloaded data in the source folder and not in the build folder. This is just to make it the same as Mateusz.cmake --help
to see the list of available generators). Fire up the terminal and go to the parent folder of the source code, then:mkdir overcome-the-chaos-build
cd overcome-the-chaos-build
cmake -G "MinGW Makefiles" ../overcome-the-chaos
build all
command:cmake --build .
cmake --build . --target help
cmake --build . --target clean
processed.pickle
and processed.xlsx
. You can see how he goes away with cleaning this Excel file by using rm
with wildcard. I think this is not a very good approach. In CMake, we have two options of how to deal with it. First option is to use ADDITIONAL_MAKE_CLEAN_FILES directory property. The code will be:SET(PROCESSED_FILE ${CMAKE_CURRENT_SOURCE_DIR}/data/processed/processed.pickle)
ADD_CUSTOM_COMMAND(OUTPUT ${PROCESSED_FILE}
COMMAND python src/data/preprocess.py ${IRIS_FILE} ${PROCESSED_FILE} --excel data/processed/processed.xlsx
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
DEPENDS rawdata ${IRIS_FILE}
)
ADD_CUSTOM_TARGET(preprocess DEPENDS ${PROCESSED_FILE})
# Additional files to clean
set_property(DIRECTORY PROPERTY ADDITIONAL_MAKE_CLEAN_FILES
${CMAKE_CURRENT_SOURCE_DIR}/data/processed/processed.xlsx
)
LIST(APPEND PROCESSED_FILE "${CMAKE_CURRENT_SOURCE_DIR}/data/processed/processed.pickle"
"${CMAKE_CURRENT_SOURCE_DIR}/data/processed/processed.xlsx"
)
ADD_CUSTOM_COMMAND(OUTPUT ${PROCESSED_FILE}
COMMAND python src/data/preprocess.py ${IRIS_FILE} data/processed/processed.pickle --excel data/processed/processed.xlsx
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
DEPENDS rawdata ${IRIS_FILE} src/data/preprocess.py
)
ADD_CUSTOM_TARGET(preprocess DEPENDS ${PROCESSED_FILE})
depends
in this custom command. We set dependency not only from a custom target, but it’s output as well and the python script. If we do not add dependency to IRIS_FILE
, then modifying iris.csv
manually will not result in rebuilding of preprocess
target. Well, you should not modify files in your build directory manually in the first place. Just letting you know. More details in Sam Thursfield's post. The dependency to python script is needed to rebuild the target if python script changes.SET(EXPLORATORY_IMG ${CMAKE_CURRENT_SOURCE_DIR}/reports/figures/exploratory.png)
ADD_CUSTOM_COMMAND(OUTPUT ${EXPLORATORY_IMG}
COMMAND python src/visualization/exploratory.py ${PROCESSED_FILE} ${EXPLORATORY_IMG}
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
DEPENDS ${PROCESSED_FILE} src/visualization/exploratory.py
)
ADD_CUSTOM_TARGET(exploratory DEPENDS ${EXPLORATORY_IMG})
build.py
. Targets/tasks are created with function decorators. Task dependencies are provided through the same decorator.from src.data.download import pydownload_file
import os
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), '.'))
from src.data.download import pydownload_file
build.py
file was like this:#!/usr/bin/python
import os
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), '.'))
from pynt import task
from path import Path
import glob
from src.data.download import pydownload_file
from src.data.preprocess import pypreprocess
iris_file = 'data/raw/iris.csv'
processed_file = 'data/processed/processed.pickle'
@task()
def rawdata():
'''Download IRIS dataset'''
pydownload_file('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', iris_file)
@task()
def clean():
'''Clean all build artifacts'''
patterns = ['data/raw/*.csv', 'data/processed/*.pickle',
'data/processed/*.xlsx', 'reports/figures/*.png']
for pat in patterns:
for fl in glob.glob(pat):
Path(fl).remove()
@task(rawdata)
def preprocess():
'''Preprocess IRIS dataset'''
pypreprocess(iris_file, processed_file, 'data/processed/processed.xlsx')
preprocess
target didn't work. It was constantly complaining about input arguments of pypreprocess
function. It seems like Pynt does not handle optional function arguments very well. I had to remove the argument for making the excel file. Keep this in mind if your project has functions with optional arguments.pynt -l
Tasks in build file build.py:
clean Clean all build artifacts
exploratory Make an image with pairwise distribution
preprocess Preprocess IRIS dataset
rawdata Download IRIS dataset
Powered by pynt 0.8.2 - A Lightweight Python Build Tool.
pynt exploratory
[ build.py - Starting task "rawdata" ]
Downloading from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data to data/raw/iris.csv
[ build.py - Completed task "rawdata" ]
[ build.py - Starting task "preprocess" ]
Preprocessing data
[ build.py - Completed task "preprocess" ]
[ build.py - Starting task "exploratory" ]
Plotting pairwise distribution...
[ build.py - Completed task "exploratory" ]
pynt exploratory
) there will be a full rebuild. Pynt didn't track that nothing has changed.@needs
). Paver makes a full rebuild each time and doesn't play nicely with functions that have optional arguments. Build instructions are found in pavement.py file.outA
.outA
twice in target A (as a result of a target, but also return it's name as part of target execution). Then we will need to specify it as input to target B. So there are 3 places in total where we need to provide information about file outA
. And even after we do so, modification of file outA
won't lead to automatic rebuild of target B. This means that if we ask doit to build target B, doit will only check if target B is up-to-date without checking any of the dependencies. To overcome this, we will need to specify outA
4 times — also as file dependency of target B. I see this as a drawback. Both Make and CMake are able to handle such situations correctly../myfile.txt
and myfile.txt
are viewed as being different. As I wrote above, I find the way of passing information from target to target (when using python targets) a bit strange. Target has a list of artifacts it is going to produce, but another target can't use it. Instead the python function, which constitutes the target, must return a dictionary, which can be accessed in another target. Let's see it on an example:def task_preprocess():
"""Preprocess IRIS dataset"""
pickle_file = 'data/processed/processed.pickle'
excel_file = 'data/processed/processed.xlsx'
return {
'file_dep': ['src/data/preprocess.py'],
'targets': [pickle_file, excel_file],
'actions': [doit_pypreprocess],
'getargs': {'input_file': ('rawdata', 'filename')},
'clean': True,
}
preprocess
depends on rawdata
. The dependency is provided via getargs
property. It says that the argument input_file
of function doit_pypreprocess
is the output filename
of the target rawdata
. Have a look at the complete example in file dodo.py.dodo.py
, pavement.py
or makefile). Rather, one has to pass a python module name. So, if we try to use it in the similar way to other tools (place a file with tasks in project's root), it won't work. We have to either install our project or modify environmental variable PYTHONPATH
by adding the path to the project.output
tells Luigi where the results of the task will end up. Results can be a single element or a list. Method requires
specifies task dependencies (other tasks; although it is possible to make a dependency from itself). And that's it. Whatever is specified as output
in task A will be passed as an input to task B if task B relies on task A.luigi --local-scheduler --module luigitasks Exploratory
Define target with dependency | Incremental builds | Incremental builds if source code is changed | Ability to figure out which artifacts to remove during clean command |
|
---|---|---|---|---|
CMake | yes | yes | yes | yes |
Pynt | yes | no | no | no |
Paver | yes | no | no | no |
doit | Somewhat yes | yes | yes | yes |
Luigi | yes | no | no | no |
К сожалению, не доступен сервер mySQL