チャットボットでネットにおもてなし革命を起こす、チャットコマース『Zeals』を開発する株式会社Zealsの技術やエンジニア文化について発信します。現在本ブログは更新されておりません。新ブログ: https://medium.com/zeals-tech-blog

Introducing Kubeflow to Zeals


Self introduce

Hi there, this is Allen from Zeals Japan. I work as a SRE / gopher which mainly responsible for microservices development.




Nowadays machine learning is everywhere and we do believe it will still be trending in the next few years. Data scientists are working on large datasets on a daily basis to develop models that help the business in different areas.

We are not an exception as our machine learning team is working on different datasets and across multiple areas including deep learning (DL), natural language processing (NLP) and behaviour prediction to improve our product, but as we are handling massive amounts of data, they soon realized working on local or using cloud provider notebook like colab or kaggle dragging down their productivity significantly.

What's wrong?

  • Unable to scale and secure more resources when they are handling heavier workload
  • Limited access to GPU
  • If you are using cloud notebook result will not persist automatically but will reset once you idle or exit
  • Hard to share notebook to with your co-workers
  • No way to use a custom image, need to setup the environment every time
  • Hard to share common dataset among the team



Current Implementation

Originally we were using a helm chart to install a jupyterhub on our kubernetes cluster and we were having a hard time managing the resources and shared datasets using shared volume.

As a part of the infrastructure team, we need to adjust the resources frequently for the ML team which is not so ideal and obviously dragging down both team's productivity.


There are multiple solutions available in the community kubespawner and helm chart we originally used.

  • kubespawner
    • Stars: 328 (2020-08-12)
    • Pros
      • Able to spawn multiple notebook deployment separated by namespace
      • Extremely customizable configuration based on python API
      • Ability to mount different volumes to different notebook deployment
    • Cons
      • Community is small
      • Lacking support on cloud native features, such as setting up network, kubernetes cluster, permission etc still need to handle manually
      • Lacking support of authorization
  • zero-to-jupyterhub-k8s
    • Stars: 740 (2020-08-12)
    • Pros
      • Official support, helm chart that's published by jupyterhub
      • Easy to setup and manage by helm
      • With good authorization support such as github and google OAuth
    • Cons
      • Limited support on individual namespace
      • Hard to declare and mount volumes based on notebook usage
      • Lacking support on cloud native features, such as setting up network, kubernetes cluster, permission etc still need to handle manually
  • Kubeflow
    • Stars: 9.2k (2020-08-12)
    • Pros
      • Good support from both author and community
      • Good support for different cloud platform
      • Not limited to notebook, but also got other tools that help with machine learning process such as pipeline and hyperparameters tuning
      • Able to easily separate namespace between different user without changing any code
      • Can easily mount multiple volumes based on the notebook usage
      • Dynamic GPU support
    • Cons
      • Very huge stack and hard to understand and customize
      • Need to be in an individual cluster, running cost is higher
      • Steep learning curve, compare to a plain notebook, using Kubeflow also requires you to have knowledge of kubernetes when you using tools like pipelines and hyperparameters tuning

What We Chose?

Kubeflow is our pick. From the above comparison, we can easily see that Kubeflow has got so many features that I think we will need in the future. The entire solution also came as a box so looks like it can be setup quite easily.

I quickly found that this might be the one I am looking for and I can't wait to try it.

Recently they released the first stable version 1.0 back in march, and I think it's a good time for us to try it.


Try it out first!

At this stage, I haven't decided to proceed with Kubeflow, but as an infrastructure person, We always need to test the tool before we introduce it to others.

The installation is quite simple for Kubeflow if you are running on cloud. They have out of the box installation script for each cloud provider and you just need to simply run it.


Setting up the project

Since we are running on GCP, I'll use that as an example, but you can also find the cloud provider you are using or even if you are hosting on-premise cluster you will find your page here.

It's good for you to create a new GCP project when you trying on something so it will be isolated from other environment.

gcloud projects create Kubeflow

Installing CLI

Following the steps here to setup OAuth so Kubeflow CLI can get access to the GCP resources.

First we need to install the Kubeflow CLI, you can find the latest binary on github releases

tar -xvf kfctl_v1.0.2_<platform>.tar.gz && mv kfctl /usr/local/bin/

Setup the GCP bases

After that just some standard gcloud configuration

# Set your GCP project ID and the zone where you want to create
# the Kubeflow deployment:
export PROJECT=kubeflow
export ZONE=asia-east-1
export CLIENT_ID=<CLIENT_ID from OAuth page>
export CLIENT_SECRET=<CLIENT_SECRET from OAuth page>

gcloud config set project ${PROJECT}
gcloud config set compute/zone ${ZONE}

Note that multi-zone is not yet supported if you want to use GPU, we're using asia-east-1 here since it is the only region that has K80 GPU support right now (2020 July).

Spinning up the cluster

Spinning up the cluster simply running

export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.2.yaml"
export KF_NAME=kubeflow
export BASE_DIR=$(pwd)
export KF_DIR=${BASE_DIR}/${KF_NAME}

mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_URI}

Customizing the deployment

For simplicity we are directly applying here, but if you want to customize the manifest it's also possible by running

export CONFIG_FILE="kfdef.yaml"
curl -L -o ${CONFIG_FILE} https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.2.yaml
kfctl build -V -f ${CONFIG_FILE}
# modify you manifest
kfctl apply -V -f ${CONFIG_FILE}

Verify the installation

Get the kube context

gcloud container clusters get-credentials ${KF_NAME} --zone ${ZONE} --project ${PROJECT}
kubectl -n kubeflow get all

Accessing the UI

Kubeflow will automatically generate an endpoint in this format, it can take a few minutes before it is accessible.

kubectl -n istio-system get ingress
# NAME            HOSTS                                                      ADDRESS         PORTS   AGE
# envoy-ingress   your-kubeflow-name.endpoints.your-gcp-project.cloud.goog   80      5d13h
# https://<KF_NAME>.endpoints.<project-id>.cloud.goog/

That's all, pretty easy! Now you can access the link and check on UI

Setting up notebook server

Creating the notebook server

Navigate to Notebook servers -> New server



You can see there are tons of configurations we can make!


Settings for the notebook server

Breaking down a bit

  • Image
    • Able to use prebuilt tensorflow notebook server or custom notebook server image
    • You can install and prebuilt common dependencies images and everyone now has access to the same setup!
  • CPU / RAM
    • Data scientist now can manage the resources by themselves without notifying the infrastructure team!
  • Workspace volume
    • Each notebook create a new workspace volume by default. This ensure you won't loss your process if you are away or the pod accidentally shutdown
    • You can even share workspace volume if you configure it to ReadWriteMany if you want to share with you team as well
  • Data volumes
    • Now it's super easy to share dataset by just using data volumes, scientists just need to choose which dataset they want to use
    • You can even mount multiple dataset in a same notebook server
  • Configurations
    • This is used for store credentials or secrets, if you are using GCP, the list will default with Google credentials so you can access gcloud command or SQL dataset / big query to access even more data.
  • GPUs
    • Now it's dynamic! but don't forget to turn it off one you finished using, otherwise it may blow up your bill!

Running our first experiment


I created a notebook server with all default parameters, running on tensorflow-2.1.0-notebook-cpu:1.0.0 image.

I want to build a simple salary prediction model following fastai tutorial.

Since the image doesn't come with fastai, simply install it

! pip install --user --upgrade pip
! pip install fastai

Training the model

We simply copy the code from the tutorial

from fastai.tabular import *

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df.to_csv('./adult.csv') # saving it for later usage

procs = [FillMissing, Categorify, Normalize]
valid_idx = range(len(df)-2000, len(df))
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

learn = tabular_learner(data, layers=[200,100], emb_szs={'native-country': 10}, metrics=accuracy)
learn.fit_one_cycle(1, 1e-2)

We successfully trained a model!

| epoch | train_loss | valid_loss | accuracy | time |
| 0 | 0.319807 | 0.321294 | 0.847000 | 00:06 |

We can simply save the current checkpoint in the workspace and retrain it next time!

torch.save(learn.model.state_dict(), 'prediction.pth')

Hyperparameters Tuning


Not limited to notebook servers, Kubeflow also has tons of other modules that are very convenient to data scientists, katib is one of the modules that you can use.

Katib provides both Hyperparameter Tuning and Neural Architecture Search, we will try out hyperparameter tuning here.

Writing the job

Using katib is extremely easy, if you are familiar with Kubernetes manifests it will be even easier for you. Katib uses Job on Kubernetes and repeatedly runs your job until it hits the target value of maximum runs.

We will use the same model salary prediction but this time we do want to tune those input values.

Training script

import argparse

from fastai.tabular import *

class MetricCallback(Callback):
    def on_epoch_end(self, **kwargs:Any):
      epoch = kwargs.get('epoch')
      acc = kwargs.get('last_metrics')[1].detach().item()
      print(f'epoch: {epoch}')

if __name__ == "__main__":

    parser = argparse.ArgumentParser(description='Process some integers.')
    parser.add_argument('--lr', type=float, help='Leanring rate')
    parser.add_argument('--num_layers', type=int, help='Layers')
    parser.add_argument('--emb_szs', type=int, help='Layers')

    args = parser.parse_args()

    df = pd.read_csv('./adult.csv')
    path = "./"

    procs = [FillMissing, Categorify, Normalize]
    valid_idx = range(len(df)-2000, len(df))
    dep_var = 'salary'
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

    data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

    # metric collecting from stdout
    # epoch 1:
    # loss=0.3
    # recall=0.5
    # precision=0.4

    learn = tabular_learner(data, layers=[200,args.num_layers], emb_szs={'native-country': args.emb_szs}, metrics=accuracy)
    learn.fit_one_cycle(5, args.lr, callbacks=[MetricCallback()])

Collecting metrics

Katib will automatically collect the train metrics from stdout, so we only need to print it out.

in the args we pass lr, num_layers and emb_szs as hyper parameters.

Job definition

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
  namespace: kubeflow-allen-ng
    controller-tools.k8s.io: "1.0"
  name: predict-salary-hyper
    type: maximize
    goal: 0.9
    objectiveMetricName: accuracy
    algorithmName: random
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
    - name: --lr
      parameterType: double
        min: "0.01"
        max: "0.03"
    - name: --num_layers
      parameterType: int
        min: "50"
        max: "100"
    - name: --emb_szs
      parameterType: int
        min: "10"
        max: "50"
        rawTemplate: |-
          apiVersion: batch/v1
          kind: Job
            name: {{.Trial}}
            namespace: {{.NameSpace}}
                - name: {{.Trial}}
                  image: gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0
                  - mountPath: /home/jovyan
                    name: workspace-salary-prediction
                    readOnly: true
                  - "cd /home/jovyan && python3 hyper_tuning.py"
                  {{- with .HyperParameters}}
                  {{- range .}}
                  - "{{.Name}}={{.Value}}"
                  {{- end}}
                  {{- end}}
                restartPolicy: Never
                - name: workspace-salary-prediction
                    claimName: workspace-salary-prediction

We using objectiveMetricName: accuracy as the target metrics and the target value is goal: 0.9

lr: random from 0.01 to 0.03
num_layers: random from 50 to 100
emb_szs: random from 10 to 50

We also configured the max count maxTrialCount: 12


The job will automatically start once you submit it.

The result will update on each job finished and you can see the result in HP -> Monitor


Previously we didn't even conduct HP tuning using only jupyter notebook, either you write a very huge loop that makes it run for a decade, or simply use your 6th sense to decide the HP.


Kubeflow is very good out of the box tool that allows you to set up the analysis environment without any pain. It provide several powerful modules like

  • Pipelines
  • Katib
  • Notebook Server

Also there are more features that we haven't touched in this article as well, such as:

  • Namespaced permission management
  • Sharing notebook among the team or outsiders
  • Continuous training and deployment for machine learning model
  • Continuous ETL integration with cloud storage or data warehouse

All of those are very common requirements from data scientists and it fit for most of the company as well.

We are still in the middle of transition so didn't manage to cover all features on Kubeflow, will definitely want to write more about it after we explore more on it.


We are hiring!

We are the industry leader of chatbot commerce in Japan, Our company is based in Tokyo, if you are talented engineer and you are interested in our company! Simply to drop an application here we can start with some casual talk first.

Opening Roles

(Sorry for that it's still japanese right now, we are working on translating it to English right now)