Changelog

All notable changes to this project will be documented in this file. The format is based on Keep a Changelog. Starting with version v1.0.0, Batsim adheres to Semantic Versioning and its public API includes the following.

  • The Batsim command-line interface.

  • The format of the Batsim input files.

  • The communication protocol with the decision-making component.


v5.0.0

  • Commits since v4.0.0

  • nix-env -f https://framagit.org/batsim/batsim/-/archive/main/batsim-main.tar.gz?ref_type=heads -iA packages.x86_64-linux.batsim

Note

This version stems from v4.0.0. All changes introduced between v4.0.0 and v4.2.1 were also included in the multiple changes v5.0.0 introduces.

Architectural and protocol changes (big breaks)

  • The Decision Process concept (a system process external to Batsim that takes decisions) has been replaced with the External Decision Component concept. In short, this is a generalization that enables the use of either external decision processes (as before) or external decision libraries that must respect a given C API. Rationale. This should improve simulation performance, as decision components can be called many times during a simulation, and calling a function is much cheaper than doing process-to-process communication. This enables the use of many tools that focuses on single-process applications, such as performance analyzers that will greatly help us optimize Batsim, or advanced debuggers such as rr. This should be an incentive to waste less energy in simulation campaigns, as taking advantage of multicore machines should be easier now (environmental side effects are much easier to manage without sockets nor redis).

  • The protocol message format has been changed from custom JSON to flatbuffers. This means messages can now be sent in binary or in JSON (but the JSON format has changed). Rationale. Messages are now typed, which we think will be easier to maintain. The definition of the protocol, as well as helper de/serialization libraries around it, are now packaged in the batprotocol git repository. Helper libraries are partly generated from a protocol description file, which will help in making sure they remain compatible with each other (without forcing all implementation to support all features). This separation should help maintainability, as protocol updates can be kept consistent among several implementations much more easily than before.

Command-line interface changes (breaks CLI unless stated explicitly)

  • Many changes were made with the introduction of the External Decision Components (EDCs) concept. The --socket-endpoint option has been replaced by --edc-socket-str and --edc-socket-file options. The --edc-library-str and --edc-library-file options have been introduced to run EDCs as libraries. Several simulation feature options that could be set from Batsim’s command-line are now set from the protocol or directly in mandatory parameters when adding an EDC into the simuation. In other words these options have been removed: --forward-profiles-on-submission, --enable-dynamic-jobs, --acknowledge-dynamic-jobs, --enable-profile-reuse, --enable-compute-sharing, --disable-storage-sharing, --sched-cfg, --sched-cfg-file, --forward-unknown-events.

  • Batsim no longer uses docopt_cpp as its command-line parsing library, and now uses CLI11 instead. This was needed because enabling several EDCs at the same time requires the parsing of options with several arguments. Rationale. Should improve maintainability, as CLI11 is much more mature, more regularly maintained, and developed with a saner rationale (e.g., lots of tests).

  • The --add-role-to-hosts option has been replaced by the --add-role option, which has a simpler syntax (one call should now be set for each host).

  • The generation of most output files has been disabled by default. The new --trace-machine-state option replaces --disable-machine-state-tracing. The --trace-pstate-change option must now be set to generate the power state changes over time CSV file.

  • The --energy option now enables two SimGrid energy plugins (on hosts and links), while this only enabled the host plugin before. You can use new options --energy-host and --energy-link if you only want to enable one of these two plugins.

  • New convenience feature (not a break). A configuration file can be used instead of stating all arguments on Batsim’s call. The --config option reads parameters from a configuration file. The --gen-config enables the generation of configuration files.

  • New tracability feature (not a break). The --batsim-git-commit and --simgrid-git-commit options should now print respectively the Batsim or SimGrid commit that were used to build your final Batsim binary file.

Output file changes (breaks)

  • Default export prefix is now out/ instead of out, which means output files will be placed into the out directory by default now. _ is no longer added to Batsim’s export prefix. Batsim now recursively creates the export directory if needed.

  • Batsim no longer generates Pajé traces, and the --disable-schedule-tracing command-line option has been removed.

  • As said on command-line interface changes, the generation of many output files has been disabled by default CLI options.

  • The schedule output file is now formatted as a JSON object instead of a CSV file.

  • A new real_exec_info output file (also JSON) is generated, aggregating information on the real execution time and memory usage.

Other notable changes

  • Batsim now uses Simgrid 4.0 (see SimGrid’s framagit releases)

  • (break) Batsim now consistently uses the complete identifiers of jobs and profiles in the related protocol events (of the form job_id!workload_name or profile_name!workload_name).

  • (break) External events support has been simplified. For now only the external events of type generic are supported.

  • Changing the Pstate of a host or turning ON/OFF a host is now possible without enabling Simgrid’s host energy plugin.

  • Batsim’s tutorials were not yet updated for this new version.

Added

  • Probes have been introduced but with limited support. One can only create periodic probes related to Simgrid’s host/link energy plugins.

Todo

talk about probes. flag –trace-probe-data

Removed (breaks)

  • Redis is no longer supported to carry meta-information about simulation events. All related CLI arguments no longer exist: --enable-redis, --redis-hostname, --redis-port, --redis-prefix. All related execution context keys no longer exist: redis_enabled, redis_hostname, redis_port, redis_prefix.

  • Machine permission checks have been removed (related to the compute-sharing` and ``storage-sharing options). It is now the user/EDC’s responsibility to make sure a job is not executed on a “storage only” host.

  • Workflows are no longer supported and pugixml has been removed from the list of dependencies.


v4.2.1

This small release is a simple update of the documentation and the links to readthedocs, to prepare for the next major version of Batsim.

v4.2.0

Added

  • New Fractional Computation trace replay profile, that enables the replay of usage traces over time. This is especially helpful to replay applications from their power consumption traces.

Fixed

  • Using simgrid::s4u::Mailbox::put_async led to invalid memory management of simgrid::s4u::Comm objects. This sometimes resulted in segmentation faults, especially when using SimGrid 3.34.0. Batsim no longer calls put_async.

  • Batsim’s memory consumption increased over time due to lazy/bad ZeroMQ buffers management — cf. issue 2 (framagit).


v4.1.0

Changed

  • Updated Batsim code / example platforms / platform generators so that they work with SimGrid-3.29.0.

Fixed

Miscellaneous

  • Improved readibility of Batsim assertion error messages.

  • Improved documentation.


v4.0.0

Changed (breaks some schedulers)

  • Profiles and jobs are now cleaned from memory over time (instead of at the end of the whole simulation). This is done with a reference counting mechanism: When a job or profile is no longer needed according to what batsim knows, it is removed from memory. This can break schedulers that rely on dynamic profile/job submission, especially when several proto_REGISTER_JOB using the same profile are decided at different simulation times — as the profile can be garbage collected when its first execution finishes. The new --enable-profile-reuse Command-line Interface option should keep previous behavior.

Removed (breaks CLI)

Added

  • Scheduler configuration can be given to Batsim (via --sched-cfg or --sched-cfg-file Command-line Interface options). This configuration string is forwarded to the scheduler in the proto_SIMULATION_BEGINS event.

  • Basic tests for the external events mechanism.

  • Retrieval of the zone properties in the XML platform description.

    • Platform properties declared within SimGrid zones are now retrieved and attached to each Batsim resource.

    • These properties are forwarded to the scheduler via the field zone_properties or each resource in the compute_resources and storage_resources arrays of the proto_SIMULATION_BEGINS event.

Fixed

  • Workflows crashed at the beginning and the end of the simulation. This should be fixed, and workflows are now tested under CI.

  • Killing jobs should no longer issue memory issues (invalid reads and writes), which caused segmentation fault in corner cases — cf. issue 37 (inria).

  • Killing sequences of delays should no longer crash with “Internal error” — cf. issue 108 (inria).

  • SMPI profiles should now be automatically killed when their walltime is reached — cf. issue 95 (inria).

Miscellaneous

  • Various performance improvements.

  • The jobs output file is now written over time (was only written on disk at the end of the simulation).

  • Batsim no longer uses SimGrid’s MSG interface. Everything is done with S4U now.

  • Smart pointers are used in most parts of the code (for reference counting memory deallocations).

  • Old markdown documentation has been removed.

  • Removal of CMake Find functions, pkgconfig is used instead.


v3.1.0

Changed

  • Batsim now requires that no proto_CALL_ME_LATER are pending to send proto_SIMULATION_ENDS.

  • Workload identifiers are now generated depending on the order of the command-line arguments. Previously, they were hashes of the absolute filename of the workload, which was order independent.

Added

  • A new External Events mechanism has been added.

    • For the moment the following external events are supported.

      • machine_unavailable: Some machines are no longer available.

      • machine_available: Some machines are available again.

      • generic: User-defined external events that can be forwarded to the scheduler with the option --forward-unknown-events.

    • A new proto_NOTIFY protocol event no_more_external_event_to_occur has been added to tell the scheduler that no more external events coming from Batsim can occur during the simulation.

    • A new command-line option was added: --forward-unknown-events that forwards unknown external events of the input files to the scheduler (ignored if there were no event inputs). The boolean value of this command is forwarded to the scheduler in the SIMULATION_BEGINS event.

Deprecated

  • Building via CMake is deprecated. Next Batsim versions may only support Meson.

Miscellaneous

  • Removed a build dependency to OpenSSL, which was only used to generate workload identifiers.

  • Batsim integration tests are now written with pytest instead of CMake.


v3.0.0

  • Commits since v2.0.0

  • Release date: 2019-01-15

  • nix-env -f https://github.com/oar-team/kapack/archive/master.tar.gz -i batsim-3.0.0

  • Recommended SimGrid commit: 97b4fd8e4

Changed (breaks protocol)

  • Removal of the NOP event.

  • SUBMIT_PROFILE has been renamed proto_REGISTER_PROFILE. Trying to register an already existing profile will now fail.

  • SUBMIT_JOB has been renamed proto_REGISTER_JOB. Trying to register an already existing job will now fail. The possibility to register profiles from within a proto_REGISTER_JOB event has been discarded. Now use proto_REGISTER_PROFILE then proto_REGISTER_JOB.

  • The proto_SIMULATION_BEGINS event has been changed:

    • The resources_data array has been split into the compute_resources and storage_resources arrays.

    • The content of the config object has been flattened and now contains the following keys: redis-enabled, redis-hostname, redis-port, redis-prefix, profiles-forwarded-on-submission, dynamic-jobs-enabled and dynamic-jobs-acknowledged.

  • The submission_finished proto_NOTIFY event has been renamed registration_finished.

  • The continue_submission proto_NOTIFY event has been renamed continue_registration.

Changed (breaks command-line interface)

  • Removal of the --config-file option. Everything should now be doable via the Batsim CLI.

  • Removal of the --enable-sg-process-tracing option. You can now use --sg-cfg to do the same.

  • --batexec has been renamed --no-sched.

  • --allow-time-sharing has been split into two options --enable-compute-sharing and --disable-storage-sharing, as resource roles have been introduced.

Changed (breaks workload format)

Changed (breaks platform format)

  • Batsim now uses SimGrid version 3.21 and therefore the SimGrid platform version 4.1, which broke things on how to define platforms. Please refer to SimGrid documentation for more information on this.

Changed (jobs/schedule output file format)

  • Breaks: The columns requested_number_of_processors and allocated_processors have been respectively renamed requested_number_of_resources and allocated_resources in the jobs output file.

  • Breaks: The order of the columns has changed in the jobs output file.

  • The columns final_state and profile have been added in the jobs output file.

  • The rejected jobs are now present in the jobs and the schedule output files.

Changed (new dependencies)

  • docopt-cpp and pugixml are now external dependencies and no longer provided with Batsim sources.

  • New intervalset dependency, which replaces the previous MachineRange class.

  • batexpe is now an optional dependency to test batsim.

Added (protocol)

  • Addition of the no_more_static_job_to_submit proto_NOTIFY event, which is sent by Batsim when all the jobs described in the static workloads/workflows have been submitted.

  • Addition of the profiles object in the proto_SIMULATION_BEGINS event. The key is the workload_id and the value is the list of profiles of that workload.

  • Addition of the optional storage_mapping object in the proto_EXECUTE_JOB event, which allows to define which resource id should be used for a named IO resource.

  • Addition of the optional additional_io_job object in the proto_EXECUTE_JOB event, which allows to add IO movements to a job execution. This is done by merging a traditional parallel task (within the allocated hosts that compute the job) with another parallel task that define IO movements (within the allocated hosts that compute the jobs, but also potentially with IO resources).

Added (platform format)

  • Roles can now be specified for the hosts of a platform. This is done by setting the role XML property of a host. A default master host can be specified this way by using the master role value. The storage value is for hosts that describe storage resources ; such hosts are allowed to send and receive bytes but not to compute. The compute_node value (used by default if no role is specified) is for hosts that describe computing resources that can both compute and communicate. More information in Roles of hosts.

Added (command-line interface)

  • New --add-role-to-hosts option, that allows to add a role to some hosts.

  • New --sg-cfg option, that allows to set SimGrid configuration options.

  • New --sg-log option, that allows to set SimGrid logging options.

  • New --dump-execution-context option, that dumps the command execution context on the standard output. This allows external tools to understand the execution context of a Batsim command without actually parsing it.

Known issues

  • Killing jobs may now crash in some (corner-case) situations. This happens since Batsim upgraded its SimGrid version. Tracked on issue 37 (inria).

  • SMPI profiles only handle relative trace filenames. Tracked on issue 97 (inria).

  • Batsim does not check job size correctly when executed with --no-sched. Tracked on issue 70 (inria).

Miscellaneous

  • Various bug fixes.

  • Removed the python experiment scripts that were located in tools/experiments, as robin became the standard tool to execute Batsim experiments.

  • Removed git submodules. Please now use schedulers directly from their repositories or from kapack.

  • Removed dependencies to GMP and cppzmq.

  • Batsim now mainly uses the s4u SimGrid interface. If you used to set SimGrid configuration/logging options through Batsim CLI, the name of such options should therefore have changed.

  • Documentation moved to readthedocs.

  • The workload_profiles directory has been renamed workloads.

  • New generator for heteregenous platforms (code and documentation in platforms/heterogeneous).

  • New demo (in demo/).


v2.0.0

  • Commits since v1.4.0

  • Release date: 2018-02-20

  • nix-env -f https://github.com/oar-team/kapack/archive/master.tar.gz -i batsim-2.0.0

  • Recommended SimGrid commit: 587483ebe

Changed (breaks protocol)

  • The QUERY_REQUEST and QUERY_REPLY messages have been respectively renamed QUERY and ANSWER. This pair of messages is now bidirectional (Batsim can now ask information to the scheduler). Redis interactions with this pair of messages is no longer in the protocol (as it has never been implemented).

  • When submitting dynamic jobs (SUBMIT_JOB), the job_id and id fields should now have the same value. Furthermore, jobs id are no longer integers but strings: my_wload!hello readers is now a valid job identifier.

  • Removal of the job_status field from JOB_COMPLETED messages.

  • JOB_COMPLETED messages should now be sent even for killed jobs. In this case, JOB_COMPLETED should be sent before JOB_KILLED.

Added

  • Added the --simgrid-version command-line option to show which SimGrid is used by Batsim.

  • Added the --unittest command-line option to run unit tests. Executed by Batsim’s continuous integration system.

  • New SET_JOB_METADATA protocol message, which allows to set set metadata to jobs. Such metadata is written in the _jobs.csv output file.

  • The _schedule.csv output file now contains a batsim_version field.

  • Added the estimate_waiting_time QUERY from Batsim to the scheduler.

  • The proto_SIMULATION_BEGINS message now contains information about workloads: A map from workload identifiers to their filenames.

  • Added the job_alloc field to JOB_COMPLETED messages, which mentions which machines have been allocated to the finished job.

Changed

  • The _jobs.csv output file is now written more cleanly. The order of the columns within it may have changed. Removal of the deprecated hacky_job_id field.

Fixed

  • Numeric sort should now work as expected (this is now tested).

  • Power tracing now works when the number of machines is big.

  • Output buffers now work even if incoming texts are bigger than the buffer.

  • The QUERY_REQUEST/QUERY_REPLY messages were not respecting the protocol definition (probably never tested since the JSON protocol update).

  • Dynamically submitted jobs could not be used right away after being submitted (by the following events, or at least the events of the same timestamp). This should now be possible.


v1.4.0

  • Commits since v1.3.0

  • Release date: 2017-10-07

  • nix-env -f https://github.com/oar-team/kapack/archive/master.tar.gz -i batsim-1.4.0

  • Recommended SimGrid commit: 587483ebe

Added

  • New SUBMIT_PROFILE protocol message that allows the decision process to submit profiles dynamically.

  • New msg_par_hg_tot profile type. This is an homogeneous parallel task whose computation and communications amounts are spread over all allocated nodes. They can be seen as optimistic moldable tasks.


v1.3.0

Added

  • Jobs walltimes are no longer mandatory. The walltime field of jobs can now be omitted or set to -1. Such jobs will never be killed automatically by Batsim.


v1.2.0

Added

  • The job progress is now sent through the protocol when jobs are killed on request. This is done via a new job_progress map in JOB_KILLED messages, which gives this information for all the jobs that have really been killed.

  • New job state COMPLETED_WALLTIME_REACHED (separated from COMPLETED_FAILED).


v1.1.0

Added

  • New job profiles SCHEDULER_SEND and SCHEDULER_RECV that communicate with the scheduler. New send and recv protocol events that correspond to them.

  • Jobs now have a return code. Can be specified in the ret field of the jobs in their JSON description. Default value is 0 (success).

  • New job state: COMPLETED_FAILED.

  • New data added to the JOB_COMPLETED protocol event. return_code indicates whether the job has succeeded. The FAILED status can now be received.

Changed

  • The repeat value of sequence (composed) profiles is now optional. Default value is 1 (executed once, no repeat).


v1.0.0

Added

  • Stated LGPL-3.0 license.

  • Code cosmetics standards are now checked by Codacy.

  • New PFS host. Associated with a new hpst-host command-line option.

  • New protocol event CHANGE_JOB_STATE. It allows the scheduler to change the state of jobs in Batsim in-memory data structures.

  • The submission_finished notification can be canceled with a continue_submission notification.

  • New data to the proto_SIMULATION_BEGINS protocol event. allow_time_sharing boolean is now forwarded. resources_data gives information on the resources. hpst_host and lcst_host give information about the parallel file system.

  • New data to the JOB_COMPLETED protocol event. job_state contains the job state (as stored by Batsim). kill_reason contains why the job has been killed (if relevant).

  • New continue_submission proto_NOTIFY event, which cancels a previous submission_finished proto_NOTIFY event.

Modified

  • Improved and renamed parallel file system profiles.

  • Improved code documentation.

  • Improved the python scripts of the tools/ directory.

  • Improved the python scripts of the test/ directory.

Fixed

  • Complex allocation mapping were not handled correctly


v0.99

  • Release date: 2017-05-26

Changed

  • The protocol is based on ZeroMQ instead of Unix Domain Sockets.

  • The protocol messages are now formatted in JSON (was custom text).