Computational resources in Boulder Opal
Understanding the Boulder Opal computing environment and utilizing the compute resources effectively
Based on your Boulder Opal plan, you are allocated machine hours and the maximum number of machines you can spin up to run concurrent calculations. For example, with the Performance plan, you can spin up to 4 machines simultaneously, which corresponds to 4 concurrent calculations. With the Professional plan, you can spin up to 16 machines simultaneously, which corresponds to 16 concurrent calculations. The total number of machine hours consumed is the number of hours aggregated across all machines running in your environment.
In this topic, we will cover how parallel computations are treated and how you can manage your Boulder Opal queue and resources.
Parallel computations
When you submit a calculation request in Boulder Opal, depending on the type of calculation, it will result in one or multiple tasks being added to your organization's queue. Boulder Opal calculations utilize multiprocessing, hence each calculation runs as quickly as possible on a single worker. By design only a single calculation is running at a time on a single worker machine to avoid any memory issues. If your Boulder Opal plan allows for concurrent calculations and enough worker machines are online, the tasks will run in parallel. Visit the Boulder Opal web app to monitor the status of your organization's queue.
Collecting requests
Most functions in Boulder Opal (for example, boulderopal.execute_graph
) can be viewed as a single task execution request to be scheduled for running on a remote machine.
However, you can use boulderopal.cloud.group_requests
to bundle multiple requests together and submit them at once.
Key considerations when using boulderopal.cloud.group_requests
:
- Up to five requests can be submitted collected together. If more are requested, a runtime error will be raised.
- The actual number of tasks that will run in parallel depends on your Boulder Opal plan and the number of available machines in your environment. You can use
boulderopal.cloud.request_machines
to spin up the machines before submitting the tasks. - The context manager blocks execution until all request results submitted are available.
- All the requests within the context manager must be independent from one another. That is, a request cannot depend on another request's result. Note that this also applies to a single function that may contain multiple requests. For example,
boulderopal.closed_loop.optimize
results in multiple requests where each step relies on the previous step's result, and may trigger a runtime error if used withboulderopal.cloud.group_requests
.
Managing the queue
In certain scenarios, your calculation may be in the queued state for an unusually long time (for example, longer than 20 minutes). Long queue times can be primarily triggered by:
- Upscaling worker machines.
- Provisioning additional virtual machine instances.
- All allowed machines for the plan are busy running calculations.
Upscaling worker machines
If there are outstanding queued tasks, your Boulder Opal environment will upscale by spinning up new worker machines.
A machine will spin up when a task is in the queue for more than 30 seconds.
For example, in the case of a boulderopal.run_optimization
calculation with optimization_count=10
the first worker machine would pick the first task.
And after every 30 seconds if there are outstanding tasks in the queue, a new worker machine will spin up to handle the outstanding tasks.
If each optimization takes about five minutes to run, the system by the end would have spun up 10 worker machines.
This is because at every 30 second interval there would always be a task outstanding in the queue.
Note that, in this scenario, it would take longer for the calculation to complete than if there were already 10 workers spun up and ready.
To reduce wait time you can pre-provision the worker machines.
Pre-provisioning worker machines
Using the boulderopal.cloud.request_machines
function, you can choose to avoid default upscaling in your environment.
With this method, you can request the number of worker machines that need to be online before you start processing your calculations.
import boulderopal as bo
bo.cloud.request_machines(machine_count)
where machine_count
is the number of machines requested to be online (with the maximum set to the number of machines allocated in your plan).
For example,
bo.cloud.request_machines(4)
Waiting for 4 machines to be online...
Current environment: 0 machines online, 4 machines pending.
Current environment: 1 machine online, 3 machines pending.
Current environment: 2 machines online, 2 machines pending.
Current environment: 3 machines online, 1 machine pending.
Current environment: 4 machines online, 0 machines pending.
Requested machines (4) are online.
Cancelling calculations and shutting down the environment
The Boulder Opal web app allows you to monitor the status of your calculations. You can also manually cancel individual running jobs to preserve resources, or completely shut down the computing environment (cancelling all running jobs).
Provisioning additional virtual machine (VM) instances
Boulder Opal worker machines are running on Amazon AWS VM instances. There may be a scenario, where underlying VM capacity has run out, hence a new VM instance has to be provisioned. This operation takes time as it requires the underlying AWS instance to be ready before the worker can be set up to execute user calculations.
All allowed machines for the plan are busy running calculations
You can also experience long queue times if you have multiple users running calculations at the same time but your plan does not support enough machines. If you are regularly experiencing this, you can upgrade to a higher plan that meets your needs.