Doing a reproducible experiment

Todo

This tutorial uses old versions of Batsim and scheduler, it has not been updated for Batsim’s new version (v5.0). It should still (mostly) work with appropriate changes in the command line used. Please do not hesitate to Contact us if you have any difficulties with this.

This tutorial shows how to execute a Batsim simulation in a fully reproducible software environment thanks to the Nix package manager. Two schedulers are used here, to show how to use a specific version of batsched and pybatsim.

Prerequisites

As a small Nix knowledge is recommended, we strongly encourage readers to follow a tutorial about Nix and reproducible experiments first. This tutorial introduces main Nix concepts and focuses on those useful for reproducible experiments, lowering the entry cost in the Nix ecosystem.

Warning

This tutorial uses old versions of Batsim and scheduler. It should still work (otherwise this is a bug, please Contact us to report the issue) and the presented steps to improve the experiment repeatibility are still relevant.

However, please refer to Using Batsim from a well-defined Nix environment if you are looking for a starting environment to copy/paste for your experiment, not to these files.

Defining a reproducible environment

A first attempt

A direct attempt to define such an environment is to use the packages defined in NUR-Kapack. This is what the following Nix file describes.

env-first-attempt.nix

{ kapack ? import
    ( fetchTarball "https://github.com/oar-team/nur-kapack/archive/master.tar.gz")
  {}
}:

with kapack.pkgs;

let
  self = rec {
    experiment_env = mkShell rec {
      name = "experiment_env";
      buildInputs = [
        # simulator
        kapack.batsim
        # scheduler implementations
        kapack.batsched
        kapack.pybatsim
        # misc. tools to execute instances
        kapack.batexpe
      ];
    };
  };
in
  self.experiment_env

Assuming that the env-first-attempt.nix file is in your working directory, you can enter into the environment with the following command: nix-shell --pure ./env-first-attempt.nix. You should be able to call batsim, batsched, pybatsim and robin from inside this shell. The shell can be exited as usual (Ctrl+D, exit…).

While this shell perfectly works to use the latest release of the desired packages, it does not describe the desired versions well enough to be reproductible: If you use the same environment in the future, the versions may have changed.

Pinning the package repository

A first step to improve our setup is to use a specific pinned version of NUR-Kapack (the repository that contains package definitions) instead of its master branch.

env-pin-pkgs-repo.nix

{ kapack ? import
    ( fetchTarball "https://github.com/oar-team/nur-kapack/archive/1672831224a21d6c34350d8f78cff9266e3e28a2.tar.gz")
  {}
}:

with kapack.pkgs;

let
  self = rec {
    experiment_env = mkShell rec {
      name = "experiment_env";
      buildInputs = [
        # simulator
        kapack.batsim
        # scheduler implementations
        kapack.batsched
        kapack.pybatsim
        # misc. tools to execute instances
        kapack.batexpe
      ];
    };
  };
in
  self.experiment_env

The env-pin-pkgs-repo.nix file is exactly the same as the previous one but about its kapack import definition: Kapack’s commit 1672831 is used instead of the master branch. This makes sure that the versions of the desired packages (batsim, batsched…) will not change.

Warning

This is not true if you use the -master variant of a package!

For example, batsim-master represents Batsim built from Batsim’s master branch’s latest commit – without any kind of commit pinning.

Pinning the packages

In many cases you might want to use a specific version of a package rather than its latest release. Nix makes it easy by overridding a package definition, as in the following file.

env-pin-everything.nix

{ kapack ? import
    ( fetchTarball "https://github.com/oar-team/nur-kapack/archive/1672831224a21d6c34350d8f78cff9266e3e28a2.tar.gz")
  {}
}:

with kapack.pkgs;

let
  self = rec {
    my_batsim = kapack.batsim-310.overrideAttrs (attr: rec {
      name = "batsim-3.1.0-346e0de";
      src = fetchgit {
        url = "https://framagit.org/batsim/batsim.git";
        rev = "346e0de311c10270d9846d8ea418096afff32305";
        sha256 = "0jacrinzzx6nxm99789xjbip0cn3zfsg874zaazbmicbpllxzh62";
      };
    });
    my_batsched = kapack.batsched-130.overrideAttrs (attr: rec {
      name = "batsched-1.3.0-db0450a";
      src = fetchgit {
        url = "https://framagit.org/batsim/batsched.git";
        rev = "db0450a608656f0661f4c8d6f132c68b2d402a59";
        sha256 = "05bn43xrk4qz8w2v58zkl17vmj4y57y93lrgrpv7jsdkird9a5vw";
      };
    });
    my_pybatsim = kapack.pybatsim-320.overrideAttrs (attr: rec {
      name = "pybatsim-3.1.0-ca81c4a";
      src = fetchgit {
        url = "https://gitlab.inria.fr/batsim/pybatsim.git";
        rev = "ca81c4a49b84fb5249367ae64bdc9289d619a033";
        sha256 = "153wxqyz2pgb3skspz9628s91zrsvbzvgxpx6c6sbjharavdnyik";
      };
    });

    experiment_env = mkShell rec {
      name = "experiment_env";
      buildInputs = [
        # simulator
        my_batsim
        # scheduler implementations
        my_batsched
        my_pybatsim
        # misc. tools to execute instances
        kapack.batexpe
      ];
    };
  };
in
  self.experiment_env

The env-pin-everything.nix file defines custom versions of Batsim, batsched and pybatsim — and uses them in the experiment_env shell. This is especially useful for using your own variant of a scheduling algorithm, as you can put the git repository of your choice in the url field of the fetchgit command and thus use your own fork of a scheduler project.

Note

Filling correctly the sha256 field of the package source overriding can be annoying.

A fast way to find the right value is to first fill it with a random sha256 value (echo hello | sha256sum), then to try to enter your shell. Nix will whine about the hash mismatch then print the hash value it computed :). In the example below, Nix computed a sha256 value of 0jacrinzzx6nxm99789xjbip0cn3zfsg874zaazbmicbpllxzh62.

hash mismatch in fixed-output derivation '/nix/store/vmqrfa7l1hg1lshaj15jz46li5v8r2qs-batsim-346e0de':
  wanted: sha256:00xyyr3fi8l6hb839bv3f7yb86yjv7xi1cgh1xnhipym4asvb4aq
  got:    sha256:0jacrinzzx6nxm99789xjbip0cn3zfsg874zaazbmicbpllxzh62

Setting up a full experiment

Note

A rendering of this experiment notebook is hosted there.

This experiment example shows how Nix can be help in designing an experiment with a reproducible software environment. Its goal is to evaluate whether the introduction of smart pointers reduced Batsim’s memory usage over time or not.

Environments

env-check-memuse-improvement.nix

#
# WARNING: do NOT use old kapack for new work!
# make sure you use NUR-kapack instead!
#
{ old_kapack ? import
  ( fetchTarball "https://github.com/oar-team/kapack/archive/773d3909d78f1043ffb589a725773699210d71d5.tar.gz") {}
, kapack ? import
  ( fetchTarball "https://github.com/oar-team/nur-kapack/archive/1672831224a21d6c34350d8f78cff9266e3e28a2.tar.gz") {}
}:

with kapack.pkgs;

let
  self = rec {
    # an old version of Batsim, before the introduction of smart pointers.
    old_batsim = old_kapack.batsim_dev.overrideAttrs (attr: rec {
      name = "batsim-3.1.0-346e0de";
      src = fetchgit {
        url = "https://framagit.org/batsim/batsim.git";
        rev = "346e0de311c10270d9846d8ea418096afff32305";
        sha256 = "0jacrinzzx6nxm99789xjbip0cn3zfsg874zaazbmicbpllxzh62";
      };
      mesonBuildType = "release";
      hardeningDisable = [];
      dontStrip = false;
    });
    # a more recent Batsim, after the introduction of smart pointers.
    batsim = (kapack.batsim-310.override{simgrid=kapack.simgrid-325;}).overrideAttrs (attr: rec {
      name = "batsim-3.1.0-5906dbe";
      src = fetchgit {
        url = "https://framagit.org/batsim/batsim.git";
        rev = "5906dbe67ba5c6229029e3ddcde5979ae116f287";
        sha256 = "08jwsgiz0s9n15pcv637sq31gyd3qzja850ycaz06kv59jlzcrfb";
      };
      mesonBuildType = "release";
      hardeningDisable = [];
      dontStrip = false;
    });
    # set of scheduling algorithms
    batsched = kapack.batsched-130.overrideAttrs (attr: rec {
      name = "batsched-1.3.0-dev";
      src = fetchgit {
        url = "https://framagit.org/batsim/batsched.git";
        rev = "54b18eb4f24bdb69617baa58b5b07842c70df094";
        sha256 = "08mw03k18m4ppschrcmyali33im4hz8060j990aa673vpdlf0pb2";
      };
    });

    # r tools around Batsim
    battools_r = kapack.pkgs.rPackages.buildRPackage {
      name = "battools-r-fcccf8a";
      src = fetchgit {
        url = "https://framagit.org/batsim/battools.git";
        rev = "fcccf8a6bccae388af6a17b866bba6c11097734f";
        sha256 = "05vll6rhdiyg38in8yl0nc1353fz2j7vqpax64czbzzhwm5d5kfs";
      };
      propagatedBuildInputs = with kapack.pkgs.rPackages; [
        dplyr
        readr
        magrittr
        assertthat
      ];
    };

    # a python tool to transform massif traces to exploitable data
    massif_to_csv = kapack.pkgs.python3Packages.buildPythonPackage rec {
      pname = "massif_to_csv";
      version = "0.1.0";
      propagatedBuildInputs = [msparser];
      src = builtins.fetchurl {
        url = "https://files.pythonhosted.org/packages/09/2d/674c3405939f198e963ba5e73c2a331ef3364bc52da9b123c8f16dd60c8d/massif_to_csv-0.1.0.tar.gz";
        sha256 = "f5eb01dce6d2e4a6c9812fd58f0add20b6739ba340482b7902f311298eb37dfb";
      };
    };

    # a massif_to_csv dependency
    msparser = kapack.pkgs.python3Packages.buildPythonPackage rec {
      pname = "msparser";
      version = "1.4";
      buildInputs = [
        kapack.pkgs.python3Packages.pytest
      ];
      src = builtins.fetchurl {
        url = "https://files.pythonhosted.org/packages/e0/68/aece1c5e75b49d95f304d2df029ae69583ef59a55694ec683e2452d70637/msparser-1.4.tar.gz";
        sha256 = "1199d27bdc492647d2d17d7776e49176f3ec3d2d959d4cfc8b2ce9257cefc16f";
      };
    };

    # dependencies in common for the two experimental environments
    common_expe_deps = [
      kapack.batexpe
      batsched
      valgrind
    ];

    # environment used to generate simulation inputs.
    input_preparation_env = mkShell rec {
      name = "input-preparation-env";
      buildInputs = [
        # to generate robin instances
        kapack.batexpe
        # to generate a batsim workload
        kapack.pkgs.R
        battools_r
        # to download a batsim platform
        curl
      ];
    };
    # environment used to execute simulations with the old Batsim version.
    simulation_old_env = mkShell rec {
      name = "old-env";
      buildInputs = [old_batsim] ++ common_expe_deps;
    };
    # environment used to execute simulations with the recent Batsim version.
    simulation_env = mkShell rec {
      name = "env";
      buildInputs = [batsim] ++ common_expe_deps;
    };
    # environment used to analyze the results, and to render them into a html document.
    notebook_env = mkShell rec {
      name = "notebook-env";
      buildInputs = [
        # Tools to analyse results.
        kapack.pkgs.R
        kapack.pkgs.rPackages.tidyverse
        kapack.pkgs.rPackages.viridis
        massif_to_csv
        # Rmarkdown-related tools (Rmarkdown is a notebook technology).
        kapack.pkgs.rPackages.knitr
        kapack.pkgs.rPackages.rmarkdown
        kapack.pkgs.pandoc
      ];
    };
  };
in
  self

The env-check-memuse-improvement.nix file describes various environments — comments describe the role of each environment. Several simulation environments are used here as we want to evaluate several versions of the same software (batsim), as using two versions of the same package in the same environment would create a collision.

Scripts

Similarly to what would (should) be done in a real experiment, some scripts are used here so that some steps are kept independent from each other and from the main engine used to run the experiment. Here, a rmarkdown notebook is used as the main engine to execute the simulation, and to analyze and present the results.

The scripts used here have the following content.

generate-workload.R

#!/usr/bin/env Rscript
library(battools)
library(dplyr)

# Read a SWF workload.
kth_sp2 = read_swf("http://www.cs.huji.ac.il/labs/parallel/workload/l_kth_sp2/KTH-SP2-1996-2.1-cln.swf.gz")

# Only work on a short period (a month) from an arbitrary time point.
date_begin = mean(kth_sp2$submit_time)
date_end = date_begin + 60*60*24*30*3
month = kth_sp2 %>% filter(submit_time >= date_begin) %>% filter(submit_time < date_end)

# Generate a Batsim workload.
workload = swf_to_batworkload_delay(month, 100, subtime_strat = "translate_to_zero")
write_batworkload(workload, "./kth_month.json")

prepare-instances.bash

#!/usr/bin/env bash
set -eu

# start from a clean directory structure
EXPE_DIR=$(realpath ./expe)
rm -rf ${EXPE_DIR}

# create instances' directories
mkdir -p ${EXPE_DIR}/old
mkdir -p ${EXPE_DIR}/new

# generate a robin file for each instance
BATCMD_BASE="batsim -p '${EXPE_DIR}/cluster.xml' -w '${EXPE_DIR}/kth_month.json' --mmax-workload"
robin generate "${EXPE_DIR}/old.yaml" \
      --output-dir "${EXPE_DIR}/old" \
      --batcmd "valgrind --tool=massif --time-unit=ms --massif-out-file='${EXPE_DIR}/old/massif.out' ${BATCMD_BASE} -e '${EXPE_DIR}/old/out_'" \
      --schedcmd "batsched -v easy_bf_fast"

robin generate "${EXPE_DIR}/new.yaml" \
      --output-dir "${EXPE_DIR}/new" \
      --batcmd "valgrind --tool=massif --time-unit=ms --massif-out-file='${EXPE_DIR}/new/massif.out' ${BATCMD_BASE} -e '${EXPE_DIR}/new/out_'" \
      --schedcmd "batsched -v easy_bf_fast"

run-notebook.R

#!/usr/bin/env Rscript
rmarkdown::render('notebook.Rmd')

Notebook

Finally, here is the notebook source:

notebook.Rmd

Batsim: Impact of smart pointers on Batsim's memory usage
=========================================================

This notebook is an example of a repeatable experiment from [one of Batsim's documentation tutorial](https://batsim.readthedocs.io/en/latest/tuto-reproducible-experiment/tuto.html).

Simulation instances preparation
--------------------------------

Here, we want to run two simulations with the same inputs but with a different Batsim version.
This can be done by executing the `prepare-instances.bash` script in its dedicated environment:

```{bash}
nix-shell env-check-memuse-improvement.nix -A input_preparation_env --command './prepare-instances.bash'
```

This creates the following files:
```{bash}
tree ./expe
```

Getting simulation inputs
-------------------------

Here we will simulate the old [KTH SP2 workload](https://www.cse.huji.ac.il/labs/parallel/workload/l_kth_sp2/index.html) from the parallel workloads archive.
The `generate-workload.R` script downloads the raw logs, extracts a month in the middle of the trace then generate a batsim workload from it. It is called in the input preparation environment:

```{bash, results="hide"}
nix-shell env-check-memuse-improvement.nix -A input_preparation_env --command '(cd ./expe && ../generate-workload.R)'
```

We will use a platform with enough resources from the Batsim repository.
Platform caracteristics do not matter much here, as we use delay profiles that are not sensitive to the jobs execution context.

```{bash}
nix-shell env-check-memuse-improvement.nix -A input_preparation_env --command 'curl -k -o ./expe/cluster.xml https://framagit.org/batsim/batsim/raw/346e0de311c10270d9846d8ea418096afff32305/platforms/cluster512.xml'
```

Running simulations
-------------------

This is done by executing robin on the instance files in their dedicated environments.
Please note that separating these two environments is mandatory, as a different Batsim version is defined in each environment.

```{bash}
nix-shell env-check-memuse-improvement.nix -A simulation_env --command 'robin ./expe/new.yaml'
nix-shell env-check-memuse-improvement.nix -A simulation_old_env --command 'robin ./expe/old.yaml'
```

Analyzing results
-----------------

First, we can visually check that the simulation results are similar.

```{r, message=FALSE, fig.width=10, fig.height=6}
library(tidyverse)
library(viridis)
theme_set(theme_bw())

# batsim-generated summaries
old_schedule = read_csv('./expe/old/out_schedule.csv') %>% mutate(instance='old')
new_schedule = read_csv('./expe/new/out_schedule.csv') %>% mutate(instance='new')
schedules = bind_rows(old_schedule, new_schedule)
schedules %>% tbl_df %>% rmarkdown::paged_table()

# jobs data
old_jobs = read_csv('./expe/old/out_jobs.csv') %>% mutate(instance='old')
new_jobs = read_csv('./expe/new/out_jobs.csv') %>% mutate(instance='new')
jobs = bind_rows(old_jobs, new_jobs) %>% mutate(color_id=job_id%%5)

jobs_plottable = jobs %>%
    mutate(starting_time = starting_time / (60*60*24),
           finish_time = finish_time / (60*60*24)) %>%
    separate_rows(allocated_resources, sep=" ") %>%
    separate(allocated_resources, into = c("psetmin", "psetmax"), fill="right") %>%
    mutate(psetmax = as.integer(psetmax), psetmin = as.integer(psetmin)) %>%
    mutate(psetmax = ifelse(is.na(psetmax), psetmin, psetmax))

jobs_plottable %>%
    ggplot(aes(xmin=starting_time,
               ymin=psetmin,
               ymax=psetmax + 0.9,
               xmax=finish_time,
               fill=color_id)) +
    geom_rect(alpha=0.9, color="black", size=0.1, show.legend = FALSE) +
    scale_fill_viridis() +
    facet_wrap(~instance, ncol=1) +
    labs(x='Simulation time (day)', y="Resources") +
    ggsave('./gantts.pdf', width=15, height=9)
```

Aggregated metrics are the same and the Gantt charts look similar.

Let us now give a look at Batsim's memory footprint over time for both runs.
```{bash}
massif-to-csv ./expe/old/massif.out{,.csv}
massif-to-csv ./expe/new/massif.out{,.csv}
```

```{r, message=FALSE, fig.width=10, fig.height=6}
old_massif = read_csv('./expe/old/massif.out.csv') %>% mutate(instance='old')
new_massif = read_csv('./expe/new/massif.out.csv') %>% mutate(instance='new')
massif = bind_rows(old_massif, new_massif) %>% mutate(
    total=(stack+heap+heap_extra) / 1e6,
    time=time/1e3)

massif %>%
    ggplot(aes(x=time, y=total)) +
    geom_step() +
    facet_wrap(~instance, ncol=1) +
    labs(x='Real time (s)', y="Batsim process's memory consumption (Mo)") +
    ggsave('./memuse_over_time.png', width=15, height=9)
```

Well okay, memory usage pattern did not change much but the overall performance improved a lot.

![](https://i.kym-cdn.com/photos/images/newsfeed/000/114/139/tumblr_lgedv2Vtt21qf4x93o1_40020110725-22047-38imqt.jpg)

The notebook can be run with the following command — which will run the simulations and analyses from the notebook.

nix-shell env-check-memuse-improvement.nix -A notebook_env --command ./run-notebook.R