Recording Clusters From The Genetic Algorithm

It is possible to record the types of clusters that are created during the genetic algorithm. In many cases, the user doesn’t want to record all the clusters that are created by the genetic algorithm, but instead record the more important ones, such as the lowest energetic clusters. This algorithm is even designed to prevent recording replicas of clusters that are the same with respect to the diversity scheme you have chosen.

All of the parameters for this components are gathered together in a dictionary, called ga_recording_information. This is passed into a class called the Recording_Cluster.

These parameters are:

ga_recording_scheme (str.):
- 'None': Do not record any clusters
- 'All': Record all clusters that are made (however, the highest energy clusters may still be removed if you give a input for limit_number_of_clusters_recorded or limit_size_of_database).
- 'Limit_energy_height': Here, all clusters will be recorded, expect for those that have an energy that is higher than \(Energy(Lowest Energy Cluster) + EnergyHeight\) (or if clusters need to be removed based your input for limit_number_of_clusters_recorded or limit_size_of_database. In this case, the highest energy clusters will be removed). This scheme also requires the user to also input into the ga_recording_information dictionary:
  'limit_energy_height_of_clusters_recorded': This is the value for \(EnergyHeight\).
- 'Set_higher_limit': All clusters will be recorded if that have an energy lower than 'Set_higher_limit' (or if clusters need to be removed based your input for limit_number_of_clusters_recorded or limit_size_of_database. In this case, the highest energy clusters will be removed). This scheme also requires the user to also input into the ga_recording_information dictionary:
  'upper_energy_limit': This is the upper energy limit. Only clusters with an energy lower than this will be recorded.
- 'Set_energy_limits': Clusters will be recorded if they have an energy between 'lower_energy_limit' and 'upper_energy_limit' (or if clusters need to be removed based your input for limit_number_of_clusters_recorded or limit_size_of_database. In this case, the highest energy clusters will be removed). This scheme also requires the user to also input into the ga_recording_information dictionary:
  'lower_energy_limit': This is the lower energy limit. Only clusters with an energy higher than this will be recorded.
  
  'upper_energy_limit': This is the upper energy limit. Only clusters with an energy lower than this will be recorded.
exclude_recording_cluster_screened_by_diversity_scheme (bool.): Once a cluster is created by the genetic algorithm, it will be put through a Diversity Scheme. After this, it may be deleted as it is found to be too simiar to another cluster (based on the criteria of that Diversity Scheme). This setting will determine weither to still record the cluster, even if it removed by the diversity scheme, or not to bother. If True, then we do not bother with recording it. If False, we will still consider recording if, even if it is removed by the Diversity Scheme. Default: True
limit_number_of_clusters_recorded (int): This is the number of clusters that the user would like to limit themselves to recording. The clusters recorded will be the lowest energetic clusters obtained from that genetic algorithm run. This can be useful as to prevent GBs of data from being written, that might not necessary be useful. For example, if limit_number_of_clusters_recorded = 50, the 50 lowest energetic clusters obtained from the genetic algorithm will be recorded. Default: None
limit_size_of_database (str.): This is the maximum size that the database can get to in KB, MB, GB or TB. For example, enter in ‘250MB’ if you want the maximum size to be 250 MB. If the database gets bigger than this, the higher energetic clusters in the database is deleted.
saving_points_of_GA ([int,…]): This parameter is not so much for recording individual clusters, but to save the state of the genetic algorithm after a certain number of generations. This can be useful if the user would like to benchmarking the genetic algorithm in some way, by using the data from the genetic algorithm after some numbers of generations has been performed. The data that will be recorded are the population, the PoolProfile, the EnergyProfile, and the Recorded_Clusters folders. This variable is a list of integers, where each entry in the list tell the algorithm to record all the data of the genetic algorithm after that number of genetations. For example, saving_points_of_GA = [2, 5, 10, 23] indicates the genetic algorithm should record the all the data once 2, 5, 10 and 23 generations have occurred. Default: None (i.e. do not bother recording anything.)
record_initial_population (bool.): If True (set if not specified) -> Create a folder called “Initial_Population” which contains the folders (including structural files) of the initial population). If False -> Do not do this. Default: True.
show_GA_Recording_Database_check_percentage (bool.): if True: This will show a loading bar when the GA_Recording_Database is being checked to make sure that all cluster in the database were made before the current generation. This can sometimes take a while so this progress bar is to show that something is happening. If False, no process bar will be shown, but the GA_Recording_Database checking procedure will still run. Set to True if you are debugging the GA_Recording_Database component of this program. Set to False if you are running this through slurm or another job scheduling manager as this can fill up the stderr file and make it very large, hard to read and hard to open (because it is so large). Default: False.

TIME/EFFICIENCY ISSUES TO NOTE IF YOU USE ga_recording_scheme = Limit_energy_height OR SET limit_number_of_clusters_recorded TO SOME VALUE OR SET limit_size_of_database TO SOME VALUE.: If you choose to use the 'Limit_energy_height' recording scheme and/or if you set limit_number_of_clusters_recorded to some value and/or set limit_size_of_database to some value, you may find that your Genetic algorithm slows down. This may occur especially if you set limit_number_of_clusters_recorded to some value and/or set limit_size_of_database to some value. Here, clusters can be removed from the cluster recording database. If the database gets quite large, this can take some time to do and can being to be prohibitive. Ultimately, if possible you should avoid using these settings to prevent any time issues that could arise. It is best to use ga_recording_information['ga_recording_scheme'] = 'None' or 'All' or 'Set_higher_limit' or 'Set_energy_limits', and do not set ga_recording_information['limit_number_of_clusters_recorded'] and ga_recording_information['limit_size_of_database'] (or set ga_recording_information['limit_number_of_clusters_recorded'] = None and ga_recording_information['limit_size_of_database'] = None).

An example of how to enter these parameters into 'ga_recording_information' for your Run.py or MakeTrials.py file is shown below:

ga_recording_information = {}
ga_recording_information['ga_recording_scheme'] = 'Limit_energy_height' # float('inf')
ga_recording_information['limit_number_of_clusters_recorded'] = 70 # float('inf')
ga_recording_information['limit_energy_height_of_clusters_recorded'] = 1.5 #eV
ga_recording_information['exclude_recording_cluster_screened_by_diversity_scheme'] = True
ga_recording_information['saving_points_of_GA'] = [50, 100, 150, 200]
ga_recording_information['record_initial_population'] = True
ga_recording_information['show_GA_Recording_Database_check_percentage'] = False

Note: If ga_recording_information = {} is all that is entered, the instance of Recording_Cluster will not record any clusters. The clusters that are recorded will be the lowest energetic structures that can be recorded. This will not take the fitness of the cluster into account, only the energy of the cluster.

Note that the Recorded_Data/GA_Recording_Database.db may become very big and hard to process on your computer with ase db. There is a program called Postprocessing_Database.py that is designed to break up the GA_Recording_Database.db database into smaller chunks if needed. See *Postprocessing_Database.py* - For breaking a large database into smaller chunks for more information on how to use this program.

See Using Databases with the Genetic Algorithm for more information about how to use databases in ASE.