After having visited many conferences in academia PyConDE & PyData Berlin 2019 was my first conference as a full time data scientist. Coming from psychological sciences, I was a heavy R user and my passion for Python grew only recently. I was more than exicted to see and hear what the community is currently up to, what trends there will be to come and what new ideas people were up to.
Deployment of ML Models
As data science grows more mature, people care about building tools that are robust rather than innovative. Thus, deployment of ml models was also a main topic at the conference. Besides some intros to Apache Airflow and a “deployment oriented mindset” one of the most interesting new frameworks was kedro. Kedro was only recently open-sourced by quantum black, a mckinsey spinn-off, and was supposedly internally used for some years. While the tool itself does nothing brandnew - it combines functionalities such as cookiecutter for project templating, pipeline operators for operation chaning and config files for defining datasets, logging and credentials - it unifies all these functionalities within a single framework. It also promises compadibility with a range of different frameworks and data types, ranging from cloud environments to excel sheets. Definitly something to try out.
Automated feature engineering
If data scientists can automate the workflow of whole marketing departments, why not also automate the workflow of a data scientist? Whenever i come across this frightening but intriguing topic I am kind of satisfied and relieved that this is not really feasibile, yet. At PyConDE & PyData Berlin 2019 two talks adressed this topic. Franziska Horn presented an interesting approach on automated feature engineering for linear regression, called autofeat, where she iteratively applies a range of typical feature transformations on the feature matrix and lasso regression to identify relevant features. The bigger goal of the approach was to transform a complex nonlinear prediction task into a linear space and thus retaining a simple regression model with the benefit of model interpretability. While an informed audience of statisticians may appreciate this, it could be hard to explain your stakeholder a three-way interaction of log transformed features. The next talk compared three different frameworks for feature engineering automation tpot (a skicit learn oriented full ml automation framework), featuretools (specialized for handling feature engineering with relational databases) and tsfresh (designed for time series feature engineering). Take home result here was that all three automation workflows needed a considerate amount of time for configuration and most promising features they yielded were also the most obvious ones that could have been conceptually derived, they may however be helpful when exploring new problems and helping the data scientist in the initial r&d phase of a new problem.
Deep Data Scientist
Are you a data scientist? If yes, now you call easily call yourself a deep data scientist. Not sure, if that is equivalent to a 10x engineer, but this title was definitely good for a pun. Reason for this title upgrade is a framework called scorch (sklearn + pytorch = skorch), which allows you to train pytorch models in a scikit learn way, and thus to simplify the workflow of training neural nets, eg making use of transformation pipelines and model stacking. Sure there are many cases where you don’t want to use this (i.e. when having a highly customized architecture), but it removes a lot of boilerplate code and provides an easy switch between “traditional” ml models and neural nets, without having to change much on your project architecture.
Above all, I was impressed by seeing so many people contributing to open source software and help maintaining all the wonderful tools we love and use everyday. People like Python core developer Mariatta who shared her experiences of hard it is to make small infrastructural changes - moving the issue tracker from Mecurial to GitHub - in an open source community project. There are many more to mention that do so much for the python open source community (i.e. Peter Wang from Anaconda, Adrin Jalali contributor of scikit learn or the NumFOCUS team) who were each individually sharing their perspectives on current development in python ml and data science.
I guess there is a lot more that I did not mention but these were some of my personal highlights (check out the full program here if you are interested) and I am most certain that I will come back another time in 2020.