My JGAP page - Genetic programming and Symbolic Regression

JGAP (Java Genetic Algorithms Package) is a Java based system for genetic algorithms and genetic programming.

For more about JGAP:

Symbolic Regression

2010-02-21: I blogged about this package in Symbolic regression (using genetic programming) with JGAP.

Download all Java and configuration files mentioned below: symbolic_regression.zip.

Here is the result of my experiment with Symbolic Regression using Genetic Programming in JGAP. Right now I'm learning both JGAP and genetic programming so there are, for sure, peculiarities in the files. After I learn more, new features will be added or old will be removed.

SymbolicRegression.java is the main program. As you may recognize, it is based on JGAP's example MathProblem.java but extended with some bells & whistles.

The program is compiled with (on a Linux box) like this:

javac -Xlint:unchecked -classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream-1.2.2.jar:jgap/lib/commons-lang-2.1.jar:$CLASSPATH" SymbolicRegression.java

and run with:

java -server -Xmx1024m -Xss2M  -classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream-1.2.2.jar:jgap/lib/commons-lang-2.1.jar:$CLASSPATH" SymbolicRegression [config file]

For compiling, all the Java files below must be downloaded and placed in the same directory as the the file. Here is my log4j.properties file.

See below for more about the configuration files.

Note: Most of these files where incorporated (in some case with changes by Klaus Meffert) in the JGAP distribution, version 3.6 (directory examples/src/examples/gp/symbolicRegression).

Defined functions

Here are my defined functions. Some of these may be considered experimental, but may be of some use. Please note that the code is not tidied up etc.

Boolean operators for DoubleClass
Here are the boolean operators for use with DoubleClass, i.e. they has double as input and returns a double (0.0d or 1.0d). Some of these functions are tested in odd_parity.conf.
- AndD.java: And
- DifferentD.java: Different
- EqualsD.java: Equals
- GreaterThanD.java: GreaterThan
- GreaterThanOrEqualD.java: GreaterThanOrEqual
- IfElseD.java: IfElse
- IfLessThanOrEqualD.java: If Less Than Or Equal Then .. Else (if a < b then c else d). Inspired by Koza's function IFLTE
- IfLessThanOrEqualZeroD.java: If Less Than Or Equal 0 Then .. Else (if a < 0 then b else c). Inspired by Koza's function IFLTZ
- LesserThanD.java: LesserThan
- LesserThanOrEqualD.java: LesserThanOrEqual
- NotD.java: Not
- OrD.java: Or
- XorD.java: Xor
ModuloD.java: Modulo with double as input and output. First the input is converted to integers and then an integer modulo is done which is returned as a double. (The standard % operator on double is not what I wanted.) This is tested in isbn_test.conf.
ModuloReplaceD.java: Sometimes we want the Modulo function not to return 0 but some other value (e.g. the highest possible values in the data set). Then this function may be tried. Note: The replacement value is manually set in the configuration option mod_replace. This should be considered highly experimental.
DivideIntD.java: A protected variant of Divide where the division is done by first converting to Integer then doing an integer division. Also, if the divisor is 0 (zero), the result is 1 (i.e. protected).
DivideProtected.java: A protected variant of Divide the result is 1 (i.e. protected) if the divisor is 0 (zero), else standard double division.
Mathematical functions.
- Cube.java: Cube (x^3)
- Gamma.java: Gamma
- Gaussian.java: Gaussian
- Hill.java: Hill
- Logistic.java: Logistic
- RoundD.java: RoundD, my version of round()
- Sigmoid.java: Sigmoid
- Sign.java: Sign
- Sqrt.java: Sqrt
- Square.java: Square (x^2)
- Step.java: Step
Other functions
- Id.java: Identity function
- LoopD.java: Loop for double. Highly experimental.
- ForLoopD.java: ForLoop for double.

Configuration files

One of the primary tasks was to be able to use a configuration file to state the problem and the data. Below are some examples. Please note that some of these are experimental (and use experimental parameters/operators), and also they may not give any interesting or good results. More info about the data/problem is usually in the header of the file.

Some of the problems was first tested with Eureqa and was commented in Eureqa: Equation discovery with genetic programming (a Google Translation of my original swedish blog post Eureqa: equation discovery med genetisk programmering).

alldifferent3.conf: All variables should be different
bolts.conf: Bolts. A machine learning example
boyles_law.conf: Boyle's law.
catalan.conf: Catalan numbers
circle_1.conf : Circle
duelling_cowboys.conf : Duelling cowboys
equation.conf : The infamous 11x11=4, 22x22=16, 33x33=? puzzle.
equation3.conf : Sequence puzzle
equation4.conf : Sequence puzzle
exp_formula.conf: Test of Exp function
exp_formula_no_exp.conf: Test of Exp function, but without Exp in the function list.
facebook_puzzle.conf: Facebook puzzle (1 + 4 = 5, 2 + 5 = 12, 3 + 6 = 21, 5 + 8 = ??)
facebook_puzzle_b.conf: Facebook puzzle (1 + 4 = 5, 2 + 5 = 12, 3 + 6 = 21, 5 + 8 = ??), alternative encoding
facebook_puzzle2.conf: Another Facebook puzzle (9 = 90, 8 = 72, 7 = 56, 6 = 42, 3 = ?)
facebook_puzzle2_b.conf: Another Facebook puzzle (9 = 90, 8 = 72, 7 = 56, 6 = 42, 3 = ?), alternative encoding
fahrenheit_celsius.conf: Fahrenheit to Celsius conversion. You may experiment by changing output_variable to 0 for the reverse conversion (C -> F).
fib1.conf: Fibonacci series as a time serie.
fib2.conf: Fibonacci series as a time serie.
fib_50.conf: Fibonacci numbers, where I try to find the closed formula for the Fibonacci number.
func1.conf: Unknown function (from a homework in a course in Evolutionary Computing)
galilei_acceleration.conf: Acceleration (BACON, via bacon-logtalk)
gamma_test.conf: Test of Gammma function
gelman.conf: Linear regression
henon_100.conf: Henon, 100 data points
heron_formula.conf: Heron formula
intro_page_262.conf: A simple problem
iris.conf: Iris data set
isbn_test.conf: Trying to get the program to calculate the checksum for ISBN13
leap_years.conf: Leap years
logit_formula.conf: Test of the Logistic function
longley.conf: Longley's data set of number employments
majority_on_3.conf: Boolean 3-majority on
mod_test.conf: Test of modulus operator.
moons.conf: Moons data
multiplexer_3.conf: 3-multiplexer (i.e. IfThenElse), using Boolean operatrs
multiplexer_3_b.conf: 3-multiplexer (i.e. IfThenElse), using if..then..else
multiplexer_6.conf: 6-multiplexer, using if..then..else
multiplexer_11.conf: 11-multiplexer
mysterious.conf: Mysterious function
number_puzzle1.conf: Number puzzle from Roger Alsing's blog post Genetic Programming: Code smarter than you!
number_puzzle4.conf: Number puzzle inspired by Richard Wiseman's It's the Friday Puzzle (2010-02-26) of finding the result 24 from the numbers 5,5,5,1 and the operators +,-,*,/. Note: the requirement that the numbers should be used exactly once is not held here. This uses only one fitness case (and the option no_terminals).
odd_parity.conf: Odd parity, using the double variants of the boolean functions, i.e. AndD, OrD, NotD (see above)
odd_parity_double.conf: Odd parity, using the arithmetic functions +,-,*,/
odd_parity2.conf: Odd parity for two inputs
p10.conf: Polynom P(10)
p4.conf: Polynom P(4)
p4_2.conf: Polynom P(4)
p4_jgap.conf: Polynom P(4). This is the version in the JGAP example MathFormula.java
p6_2.conf: Polynom P(6)
pickover_puzzle1.conf: Sequence puzzle (via Cliff Pickover)
planets.conf: Planets, i.e. Kepler's third law.
puzzle3.conf: Sequence puzzle
quintic.conf: Quintic polynomial
regression_koza.conf: Regression (0.5 * x^2, from John R. Koza's Lisp implementation)
regression_psh.conf: Regression (y = 12x^2 + 5, from Psh)
seq_ind1.conf: Sequence induction problem: 5*j^4+4*j^3+3*j^2+2^j+1 (for integers 0..10)
sequence_induction.conf: Sequence induction
sequence_puzzle.conf: Sequence puzzle
sequence_puzzle_recursion.conf: Sequence puzzles (e.g. Fibonacci)
sigmoid_test.conf: Test of Sigmoid function
sin_formula.conf: Test of Sine
sin_formula_rand20.conf: Test of Sine
sine_tiny_gp.conf: Test of Sine from TinyGP
sorted_3.conf: Sorting 3 variables.
sqrt_formula2.conf: Yet another test of Sine
sqrt_formula3.conf: Yet another test of Sine
sqrt_formula.conf: Test of Sqrt function
sunspots.conf: Sunspots data as time series
sunspots_timeseries.conf: Two version of sunspots data using make_time_series
test1.conf: A test of many functions.
test2.conf: A test of new functions.
tic_tac_toe.conf: Tic-tac-toe
timeseries_test1.conf: Time series with make_time_series
timeseries_dailyisbn.conf: Time series using the classic time series "Daily closing price of IBM stock, Jan 1, 1980 to Oct. 8, 1992" (DAILYIBM.DAT from TSDL)
triangular_numbers.conf: Triangular numbers
weather.conf: Weather (classic classification example)
x2.conf: A simply polynomial: x^2
x4-x3+x2-x.conf: Polynomial: x^4-x^3+x^2-x
zoo2.conf: Zoo (classic classification example)

The configuration parameters

The configuration file consists of the following parameters. Here is a short explanation; the full story is in the code: SymbolicRegression.java. Most of the parameters has reasonable default values, taken from either MathProblem.java or GPConfiguration.

#, %: Line comments; lines that start with the characters "#" or "%" will be ignored.
presentation: A text which is shown first in the run.
num_input_variables: Number of input variables in the data set.
output_variable: The index (0-based) of the output variable. Default is the last variable.
variable_names: The name of the variables, in order. Default is "V0", "V1", etc
data: Starts the data section, where each row is presented per line. The attributes may be separated by "," or some space. Decimal point is a . (dot).
If a data row contains a ? (question mark) in the position of the output variable, then it is considered a "user defined test" and the fittest program will be tested against this data last in the run.
terminal_range: The range for the Terminal as lower upper. Note: Only one Terminal is used.
terminal_wholenumbers: If the Terminal should use wholenumbers or not (boolean)
constant: Define a Constant with this value
functions: Define the functions, with the same name as in JGAP (or own defined functions).
adf_arity: If > 0 then ADF is used. This is somewhat experimental as I am still try to understand how ADF:s works.
adf_function: The functions used for ADF.
adf_type: Either double or boolean. If set to boolean, we can use the boolean and logical operators.
max_init_depth: JGAP parameter maxInitDepth
min_init_depth: JGAP parameter minInitDepth
program_creation_max_tries: JGAP parameter programCreationMaxTries
population_size: JGAP parameter populationSize
max_crossover_depth: JGAP parameter maxCrossoverDepth
function_prob: JGAP parameter functionProb
reproduction_prob: JGAP parameter reproductionProb
mutation_prob: JGAP parameter mutationProb
crossover_prob: JGAP parameter crossoverProb
dynamize_arity_prob: JGAP parameter dynamizeArityProb
no_command_gene_cloning: JGAP parameter no_command_gene_cloning
use_program_cache: JGAP parameter use_program_cache
new_chroms_percent: JGAP parameter newChromsPercent
num_evolutions: JGAP parameter numEvolution
tournament_selector_size: JGAP parameter tournamentSelectorSize
max_nodes: JGAP parameter maxNodes
scale_error: Sometimes the data values are very small which gives small fitness values (i.e. errors), making it hard to get any progress. Setting this parameter will multiply the errors by this value.
stop_criteria_fitness: If set (>= 0) then the program will run "forever" (instead of num_evolution) until fitness is less or equal to the value.
show_population: This shows the whole population in each generation. Mainly for debugging purposes.
show_similiar: Shows all the solutions (programs) with the same fitness value as the best solution. Alternative name: show_similar.
similiar_sort_method: Method of sorting the similiar solutions when using show_similiar. Alternative name: similar_sort_method. Valid options:
- occurrence: descending number of occurrences (default)
- length: ascending length of solutions
show_progression: boolean. If true then the generation number is shown for all generations when nothing is happening (i.e. no gain in fitness).
sample_pct: (float) Takes a (sample) percentage of the data set if > 0.0.
validation_pct: Withheld a percentage of the test cases for a validation set. This fitness of this validation set is shown.
show_all_generations: Show info of all generations, not just when fitness is changed.
hits_criteria: Criteria of a hit: if the difference is <= this value, it is considered a hit. The number of non-hits is then used as a fitness measure instead of the sum of errors. Setting this function also shows the number of programs which is <= this value.
mod_replace: Setting the replacement value of 0 (zero) for the ModuloIntD function (see above).
showResults: boolean. If set then all the fitness cases is shown with the output of the fitted program, with difference to the correct values.
resultPrecision: the precision in the output used in showResult, default 5
error_method: Error method to use. Valid options are
- totalError: sum of (absolute) errors (default)
- minError: minimum error
- meanError: mean error
- medianError: median error
- maxError: max error
no_terminals: If true then no Terminal is used, i.e. no numbers, just variables. Default false.
make_time_series: Make a time series of the first line of data. The value of num_input_variable determines the number of laps (+1 for the output variable)
make_time_series_with_index: As make_time_series with an extra input variable for the index of the series. (Somewhat experimental.)
minNodes: value penalty: minimum number of nodes (terminals + functions). If the number of nodes in a program is less than value then a penalty of penalty is added.
alldifferent_variables: true/false penalty: all the variables (terminals) should be different. If there is more than one occurrence of an variable in a program then a penalty of penalty is added (for each extra variable).
ignore_variables: (TBW) It would be nice to be able to ignore some variables in the data set. But this is yet to be written.
return_type: (TWB) This should be the type of the "main" return value. Note: it is now hard coded in the program as double/DoubleClass.

Supported function

The program has support for the following functions from JGAP. The "main" type is double so all functions are not applicable there (e.g. IfElse etc). However, for the ADF functions (defined by setting adf_arity to > 0) many more functions is supported. Please note that some of these are (very) experimental and maybe don't even make sense in this context.

Multiply (double)
Multiply3 (double)
Add (double)
Add3 (double)
Add4 (double)
Divide (double)
Subtract (double)
Sine (double)
ArcSine (double)
Tangent (double)
ArcTangent (double)
Cosine (double)
ArcCosine (double)
Exp (double)
Log (double)
Abs (double)
Pow (double)
Round (double)
Ceil (double)
Floor (double)
Modulo (double), implements Java's % operator for double. See ModuloD for a variant
Max (double)
Min (double)
LesserThan (boolean)
GreaterThan (boolean)
If (boolean)
IfElse (boolean)
IfDyn (boolean)
Loop (boolean)
Equals (boolean)
ForXLoop (boolean)
ForLoop (boolean) (cf the double variant ForLoopD)
Increment (boolean)
Pop (boolean)
Push (boolean)
And (boolean), cf the double variant AndD
Or (boolean), cf the double variant OrD
Xor (boolean), cf the double variant XorD
Not (boolean), cf the double variant NotD
SubProgram (boolean)
Tupel (boolean)

Also, see my own defined functions defined in the Java files above.

Examples

Here are two small examples of the program, including the configuration file and a sample run.

Polynom
Here is a simple example of a configuration file. It happens to be the same problem as the JGAP example MathProblem, the polynom x^4 + x^3 + x^2 - x.

#
# Polynom x^4 + x^3 + x^2 - x
# The JGAP example
#
presentation: P(4) x^4 + x^3 + x^2 - x (the JGAP example)
num_input_variables: 1
variable_names: x y
functions: Add,Subtract,Multiply,Divide,Pow,Log,Sine
terminal_range: -10 10
max_init_depth: 4
population_size: 1000
max_crossover_depth: 8
num_evolutions: 800
max_nodes: 20
stop_criteria_fitness: 0.1
data
-2.378099   26.567495
4.153756   382.45743
2.6789956   75.23481
5.336802   986.33777
2.4132318   51.379707
-1.7993588   9.693933
3.9202332   307.8775
2.9227705   103.56364
-0.1422224   0.159982
4.9111285   719.39545
1.2542424   4.76668
1.5987749   11.577456
4.7125554   615.356
-1.1101999   2.493538
-1.7379236   8.631802
3.8303614   282.29697
5.158349   866.7222
3.6650343   239.42934
0.3196721   -0.17437163
-2.3650131   26.014963

A simple run, slightly edited.

It was 20 data rows
Presentation: P(4) x^4 + x^3 + x^2 - x (the JGAP example)
output_variable: y (index: 1)
input variable: x
function1: &1 + &2
function1: &1 - &2
function1: &1 * &2
function1: /
function1: &1 ^ &2
function1: log &1
function1: sine &1
function1: 1.0
[19:52:57] INFO  GPGenotype - Creating initial population
[19:52:57] INFO  GPGenotype - Mem free: 10.5 MB
[19:52:57] INFO  GPPopulation - Prototype program set
[19:52:57] INFO  GPGenotype - Mem free after creating population: 10.5 MB
Creating initial population
Mem free: 10.5 MB
Evolving generation 1/800, memory free: 6.7 MB (time from start:  0,42s)
Best solution fitness: 968.56
Best solution: x ^ 4.0
Depth of chrom: 1
Correlation coefficient: 0.9995009838030151
Evolving generation 4/800, memory free: 11.4 MB (time from start:  0,84s)
Best solution fitness: 813.62
Best solution: ((4.0 + 4.0) + (log 4.0)) * ((x ^ 3.0) / (9.0 / x))
Depth of chrom: 3
Correlation coefficient: 0.999500983803015
Evolving generation 6/800, memory free: 6.7 MB (time from start:  1,07s)
Best solution fitness: 712.97
Best solution: ((5.0 * x) + ((x ^ 4.0) + x)) + x
Depth of chrom: 4
Correlation coefficient: 0.999781550858714
Evolving generation 7/800, memory free: 40.3 MB (time from start:  1,20s)
Best solution fitness: 582.77
Best solution: ((9.0 * x) + (x * 9.0)) - ((9.0 - x) - (x ^ 4.0))
Depth of chrom: 3
Correlation coefficient: 0.9965296891627895
Evolving generation 8/800, memory free: 24.9 MB (time from start:  1,32s)
Best solution fitness: 471.58
Best solution: (((x + 7.0) * (x * x)) * x) * (x / 9.0)
Depth of chrom: 4
Correlation coefficient: 0.9980245718601822
Evolving generation 11/800, memory free: 29.6 MB (time from start:  1,69s)
Best solution fitness: 364.73
Best solution: ((x * 8.0) + ((x ^ 4.0) + x)) - (4.0 - ((x * x) * x))
Depth of chrom: 4
Correlation coefficient: 0.9988207761993971
Evolving generation 16/800, memory free: 33.2 MB (time from start:  2,24s)
Best solution fitness: 317.84
Best solution: ((x * ((x * x) + (log 9.0))) * x) + (x * 9.0)
Depth of chrom: 5
Correlation coefficient: 0.9993672319814505
Evolving generation 17/800, memory free: 19.4 MB (time from start:  2,38s)
Best solution fitness: 169.76
Best solution: (x ^ 4.0) + (x * (x * x))
Depth of chrom: 3
Correlation coefficient: 0.9999752402614274
Evolving generation 22/800, memory free: 24.2 MB (time from start:  3,10s)
Best solution fitness: 136.21
Best solution: ((x * x) * x) + (x + (x * (x * (x * x))))
Depth of chrom: 5
Correlation coefficient: 0.9999485732269622
Evolving generation 23/800, memory free: 10.4 MB (time from start:  3,20s)
Best solution fitness: 3.7509724195622374E-4
Best solution: (x * ((((x * x) + x) * x) + x)) - x
Depth of chrom: 6
Correlation coefficient: 0.9999999999999939

Fitness stopping criteria (0.1) reached with fitness 3.7509724195622374E-4 at generation 23

All time best (from generation 23)
Evolving generation 23/800, memory free: 10.4 MB (time from start:  3,20s)
Best solution fitness: 3.7509724195622374E-4
Best solution: (x * ((((x * x) + x) * x) + x)) - x
Depth of chrom: 6
Correlation coefficient: 0.9999999999999939

Total time  3,20s

Fibonacci series
Here is another example, Fibonacci series as time serie. The object is to give a symbolic regression on the fourth value (F4), which is solved quite fast.

#
# Fibonacci with 3 variables
# 
presentation: This is the Fibonacci series
return_type: DoubleClass
num_input_variables: 3
variable_names: F1 F2 F3 F4
functions: Multiply,Divide,Add,Subtract
terminal_range: -10 10
max_init_depth: 4
population_size: 20
max_crossover_depth: 8
num_evolutions: 100
max_nodes: 21
show_similiar: true
show_population: false
stop_criteria_fitness: 0
data
1,1,2,3
1,2,3,5
2,3,5,8
3,5,8,13
5,8,13,21
8,13,21,34
13,21,34,55
21,34,55,89
34,55,89,144
55,89,144,233
89,144,233,377
144,233,377,610
233,377,610,987
377,610,987,1597
610,987,1597,2584
987,1597,2584,4181
1597,2584,4181,6765
2584,4181,6765,10946
4181,6765,10946,17711
6765,10946,17711,28657
10946,17711,28657,46368

And a sample run:

Presentation: This is the Fibonacci series
output_variable: F4 (index: 3)
input variable: F1
input variable: F2
input variable: F3
function1: &1 * &2
function1: /
function1: &1 + &2
function1: &1 - &2
function1: -5.0
[19:56:31] INFO  GPGenotype - Creating initial population
[19:56:31] INFO  GPGenotype - Mem free: 7.5 MB
[19:56:31] INFO  GPPopulation - Prototype program set
[19:56:31] INFO  GPGenotype - Mem free after creating population: 7.5 MB
Creating initial population
Mem free: 7.5 MB
Evolving generation 1/100, memory free: 6.7 MB (time from start:  0,03s)
Best solution fitness: 17649.74
Best solution: (F3 - (-6.0 - F1)) + ((F2 - F2) - (-6.0 / F1))
Depth of chrom: 3
Correlation coefficient: 0.9999999838289084
Evolving generation 6/100, memory free: 6.0 MB (time from start:  0,13s)
Best solution fitness: 147.0
Best solution: 7.0 + (F2 + F3)
Depth of chrom: 2
Correlation coefficient: 1.0
Evolving generation 7/100, memory free: 5.8 MB (time from start:  0,14s)
Best solution fitness: 0.0
Best solution: F2 + F3
Depth of chrom: 1
Correlation coefficient: 1.0

Fitness stopping criteria (0.0) reached with fitness 0.0 at generation 7

All time best (from generation 7)
Evolving generation 7/100, memory free: 5.8 MB (time from start:  0,15s)
Best solution fitness: 0.0
Best solution: F2 + F3
Depth of chrom: 1
Correlation coefficient: 1.0

Total time  0,15s

All solutions with the best fitness (0.0):
F2 + F3 (1)

Todo

Here are some TODO:s, or things nice to have.

option for ignoring specific variables
option for stopping:
- running forever
- after a specific time,
- ~~when a specific fitness value is reached~~ Fixed (by stop_criteria_fitness)
~~calculate the number of data rows automatically (i.e. skip num_row)~~ Fixed.
accept nominal values in the data section; then converted to numeric values.
~~show some progress output.~~ Fixed (by show_progress)
~~make the output from this program instead from log4j.~~ Fixed.
~~for larger data sets it would be nice with an option to use just a sample (percentage) of the full data.~~ Fixed (by sample_pct)
~~Withheld a certain percentage for validation of the best fitted program.~~ Fixed (by validation_pct)
~~show number of programs less or equal a hits critera~~ Fixed (by hits_criteria)
~~show info for all generations~~ Fixed (by show_all_generations)
~~support for user defined test instances~~ Fixed (by ? in a data row)
~~show number of functions / terminals of the best program~~ Fixed.
add more fitness metrics.
better handling of punishing longer solutions (parsimony pressure).
support for different "main" return classes, i.e. not just DoubleClass
more statistical measures, e.g. R-squared, mean squared error, mean absolut error, minimum error, maximum error
more/better error checks
more building blocks, a la Eureqa http://ccsl.mae.cornell.edu/eureqa_ops
support for derivattives (a la Eureqa)?
incorporate in Weka?
simplify the best solution with a CAS?

Mention/Usage

Here are some mentions or usage of my Symbolic Regression program:

jHepWork online tutorial: Symbolic regression
code.google.com: regresion-simbolica

This page was created by Hakan Kjellerstrand (hakank@bonetmail.com), homepage.