EASEA tutorial

De
Aller à la navigation Aller à la recherche

Tutorial

This is a hands-on tutorial to understand how to use EASEA. It will use simple examples that both show how evolutionary algorithms are working and how the easena compiler can be used to achieve the expected results.

1) The onemax problem

Please find here a most basic onemax.ez code (also found in the examples directory of the EASEA distribution) implementing a onemax problem (starting out of a population of individuals made of a totally random array of 1's and 0's, create an individual made of only 1's). Contrarily to what is said here and there on the internet, the onemax problem is quite difficult for artificial evolution because the default evaluation function (sum of the bits composing the individuals) is not very informative on where the good genes are located and therefore cannot correctly guide the crossover function.

To run this code, copy the contents of the onemax.ez code into a file called "onemax.ez" and follow the instructions below:

  1. compile using using the following command line:
    $ easena onemax.ez
    If you look at the contents of your directory, you will find that easena has created many other c++ source files out of your onemax.ez file. Because easena is nice with its users, it has also created a Makefile file so that you can now compile the created source files with:
  2. $ make
    You can now then run the onemax executable file with:
  3. $ ./onemax

Please note that because artificial evolution is a stochastic optimization algorithm running the same algorithm several times will always yield different results... This means that no repeatability is guaranteed (you may find a wonderful solution one day and never find it again). This also means that bug chasing could be difficult if your code hangs once in a while.

Fortunately, random numbers are created thanks to a deterministic random number generator using a seed number as a starting point meaning that stochastic algorithms can be repeated exactly, if you provide them with the exact same seed value.

The EASEA compiler will create an executable file that accepts many command line options, that you can list by typing:
$ ./onemax --help
One of these options is called --seed. Without the --seed option, EASEA will always use a different seed value whenever the code is run. But if you fix the seed, you will obtain reproducibility.

Because this is also a tutorial on artificial evolution, it is important for you to know that due to the fact that with different seed values, the result will be different, the good practice when evaluating the quality of a stochastic algorithm is to run it (with a different seed) at least 30 times so as to "cancel" the effect of randomness, as the law of large numbers kicks in after around 30 trials. In the example below, we will always test the code with seed 42, to guarantee reproducibility.

Because of the variability of results coming out of a stochastic algorithm, but because it is possible to control this thanks to seed values, it is necessary to get some inspiration from the biologists and copy their good practice of keeping a laboratory notebook, where all parameters and seeds should be written down in order to be able to find back good parameters.

If you execute $ ./onemax --seed 42 several times, you will always find value 7.18e+02 (=718).

Now let's have a look at the code that produced this.

Section \User declarations :

After a header describing the file, the first "paragraph" is the following:

\User declarations : // section inserted at the very beginning of the object source code
#define SIZE 1000    // meaning that you can use macros or define global variables
float fPROB_MUT_PER_GENE=1/(float)SIZE;
\end

EASEA source codes work with "sections", that are processed by the EASEA compiler to produce clean and human-readable c++ source code, that must then be compiled by a standard c++ compiler, as seen above. Please open the created .cpp and .hpp files to understand what EASEA is doing.

So we saw above the first section, which is called \User declarations :.

Sections must be ended with \end. Section names are designed to be self-explanatory. Therefore, the \User declarations : is placed above all the c++ code created by EASEA out of the onemax.ez source file.

Unused sections

Because the onemax problem is very simple, many optional sections are not used. We will go through them to explain their role (please note how their names have been chosen to be explicit): \User functions: \end

\User CUDA: \end

\Before everything else function: \end

\After everything else function: \end

\At the beginning of each generation function: \end

\At the end of each generation function: \end

\At each generation before reduce function: \end



experiment

2) Faux-Fourier transform

The idea of the onemax tutorial was for you to get a really simple example of an evolutionary algorithm running, and have fun tuning the different parameters to see their effect. Let's get more serious now, by trying to find out how a periodic signal decomposes into a sum of sines.

As always in computer science, the best way to start a new project is to start from a previous working project, so let's start from the previous "optimised" good-onemax.ez, by copying it to faux_fourier.ez ($ cp onemax.ez faux_fourier.ez). Beware: using "faux-fourier.ez" will get you a compilation error, because

Reproducibility

Running evolutionary algorithms is a discipline that is very close to biology laboratory work, because if you use a different random seed each time you launch an experiment, you will always get different results.

Therefore, many tools used to manage wet labs can be of great inspiration when running evolutionary algorithms. Let me quote Rules 5, 6 and 7 from Santiago Schnell's excellent "Ten Simple Rules for a Computational Biologist’s Laboratory Notebook" that can be found here:

Schnell S (2015) Ten Simple Rules for a Computational Biologist’s Laboratory Notebook. PLoS Comput Biol 11(9): e1004385. doi:10.1371/journal.pcbi.1004385

Rule 5: Every Entry Should Be Recorded with a Date, Subject, and Protocol
The most logical organization of a lab notebook is chronological. Each entry should contain the date it was made and subject of the entry. If you make distinct sets of entries in the same day, you should separate them by using heading titles and leave sufficient space between the entries [1]. The titles of your entries are important. They should be short, sharp, and informative, as you will use them to build a table of contents for your lab notebook. If you are using a paper lab notebook, you will need to write the date and time stamp and make your entries legible, written in permanent ink and in a language accessible to everyone in the laboratory. If you use an electronic lab notebook, the date and time stamps will be entered automatically for each new subject.
You should also include a brief background for each entry [2]. It could be a short synopsis of your thought process that explains why this entry is important for your scientific work. You can support it with references published in the literature, ideas learned from a research talk, or a previous model developed by your laboratory. Then, record everything you do for this specific entry. If you make a mistake in your paper lab notebook, put a thin line through the mistake and write the new information next to it [1]. Never erase or destroy an entry, because errors or mistakes are an important part of the scientific process.
Rule 6: Keep a Record of How Every Result Was Produced
The gold standard of science is reproducibility [7]. You need to keep a record of how every result was produced in your in silico experiments, statistical analyses, and mathematical or computational models. Noting the sequence of steps taken allows for a result or analysis to be reproduced. For every step in your model, analysis, or experiment, you should record every detail that will influence its execution [8]. This includes the preparation of your wet or dry experiment, preprocessing, execution with intermediate steps, analysis, and postprocessing of results [1,2]. You should also store the raw data for every figure. This will allow you to have the exact values for the visualization of your results. It will also give you the opportunity to redraw figures to improve their quality or ensure visual consistency for publication. If a result is obtained through a mathematical model, algorithm, or computer program, you need to include the equations and name and version of the algorithm or computer program, as well as the initial conditions and parameters used in the model. In many instances, a statistical or computational analysis creates intermediate data [9]. You should record intermediate results because they can reveal discrepancies with your final results, particularly if your analysis is time consuming or readily executable. At the same time, they can help you track any inconsistencies in your analysis or algorithms [8]. Electronic lab notebooks can be very convenient for storing data and linking to computer mathematical models, algorithms, and software stored in cloud mass storage systems.
Rule 7: Use Version Control for Models, Algorithms, and Computer Code
As a mathematical and computational biologist, you will be updating your models, algorithms, or computer programs frequently. You will also create scripts containing initial conditions and parameters to run analyses or simulations. Changes in your models, algorithms, programs, or scripts could drastically change your results. If you do not systematically archive changes, it will be very difficult or impossible to track the codes that you used to generate certain results [8,9]. Nowadays, there are version control systems to track the evolution of algorithms and computer programs or changes in scripts. Bitbucket, Git, Subversion, and Mercurial are among the most widely used version-control systems. You should use a standardized name system to identify changes. If you have a paper lab notebook, you should record the name and location of the scripts. Those using electronic lab notebooks can add links to each version of their scripts or programs.
References
1. Thomson JA (2007). How to Start—and Keep—a Laboratory Notebook: Policy and Practical Guidelines. In: Intellectual Property Management in Health and Agricultural Innovation: A Handbook of Best Practices (eds. Krattiger A, Mahoney RT, Nelsen L, et al.). MIHR: Oxford, U.K., pp. 763–771.
2. National Institutes of Health, Office of Intramural Training and Education’s Webinar on Keeping a Lab Notebook. https://www.training.nih.gov/OITEtutorials/OITENotebook/Notebook.html
3. Sandefur CI (2014). Blogging for electronic record keeping and collaborative research. Am J Physiol Gastrointest Liver Physiol. 307:G1145–G1146. doi: 10.1152/ajpgi.00384.2014 PMID: 25359540
4. De Polavieja Lab. Open LabBook. http://www.neural-circuits.org/open-labbook
5. Todoroki S, Konishi T, Inoue S (2006). Blog-based research notebook: personal informatics workbench for high-throughput experimentation. Appl Surf Sci 252: 2640–2645.
6. Ruggiero P, Heckathorn MA (2012). Data backup options. Department of Homeland Security, United States Computer Readiness Plan. https://www.us-cert.gov/sites/default/files/publications/data_ backup_options.pdf.
7. Jasny BR, Chin G, Chong L, Vignieri S (2011). Data replication & reproducibility. Again, and again, and again .... Introduction. Science. 334(6060):1225. doi: 10.1126/science.334.6060.1225 PMID: 22144612
8. Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013). Ten simple rules for reproducible computational research. PLoS Comput Biol. 9(10):e1003285. doi: 10.1371/journal.pcbi.1003285 PMID: 24204232
9. Peng RD (2011). Reproducible research in computational science. Science. 334:1226–7. doi: 10. 1126/science.1213847 PMID: 22144613
10. U.S. Department of Health & Human Services, Office of Research Integrity. Notebook and data management. http://ori.hhs.gov/education/products/wsu/data.html
11. U.S. Department of Health & Human Services, Office of Research Integrity. Data Management Responsibilities. http://ori.hhs.gov/education/products/unh_round1/www.unh.edu/rcr/DataMgt- Responsibilities01.htm