My JGAP page - Genetic programming and Symbolic Regression
JGAP (Java Genetic Algorithms Package) is a Java based system for genetic algorithms and genetic programming.
For more about JGAP:
Symbolic Regression
2010-02-21: I blogged about this package in Symbolic regression (using genetic programming) with JGAP.
Download all Java and configuration files mentioned below: symbolic_regression.zip.
Here is the result of my experiment with Symbolic Regression using Genetic Programming in JGAP. Right now I'm learning both JGAP and genetic programming so there are, for sure, peculiarities in the files. After I learn more, new features will be added or old will be removed.
SymbolicRegression.java is the main program. As you may recognize, it is based on JGAP's example MathProblem.java
but extended with some bells & whistles.
The program is compiled with (on a Linux box) like this:
javac -Xlint:unchecked -classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream-1.2.2.jar:jgap/lib/commons-lang-2.1.jar:$CLASSPATH" SymbolicRegression.java
and run with:
java -server -Xmx1024m -Xss2M -classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream-1.2.2.jar:jgap/lib/commons-lang-2.1.jar:$CLASSPATH" SymbolicRegression [config file]
For compiling, all the Java files below must be downloaded and placed in the same directory as the the file. Here is my log4j.properties file.
See below for more about the configuration files.
Note: Most of these files where incorporated (in some case with changes by Klaus Meffert) in the JGAP distribution, version 3.6 (directory examples/src/examples/gp/symbolicRegression
).
Defined functions
Here are my defined functions. Some of these may be considered experimental, but may be of some use. Please note that the code is not tidied up etc.
- Boolean operators for DoubleClass
Here are the boolean operators for use with DoubleClass, i.e. they has double
as input and returns a double
(0.0d or 1.0d). Some of these functions are tested in odd_parity.conf.
- ModuloD.java: Modulo with
double
as input and output. First the input is converted to integers and then an integer modulo is done which is returned as a double. (The standard %
operator on double is not what I wanted.) This is tested in isbn_test.conf.
- ModuloReplaceD.java: Sometimes we want the Modulo function not to return 0 but some other value (e.g. the highest possible values in the data set). Then this function may be tried. Note: The replacement value is manually set in the configuration option
mod_replace
. This should be considered highly experimental.
- DivideIntD.java: A protected variant of
Divide
where the division is done by first converting to Integer
then doing an integer division. Also, if the divisor is 0 (zero), the result is 1 (i.e. protected).
- DivideProtected.java: A protected variant of
Divide
the result is 1 (i.e. protected) if the divisor is 0 (zero), else standard double division.
- Mathematical functions.
- Other functions
Configuration files
One of the primary tasks was to be able to use a configuration file to state the problem and the data. Below are some examples. Please note that some of these are experimental (and use experimental parameters/operators), and also they may not give any interesting or good results. More info about the data/problem is usually in the header of the file.
Some of the problems was first tested with Eureqa and was commented in Eureqa: Equation discovery with genetic programming (a Google Translation of my original swedish blog post Eureqa: equation discovery med genetisk programmering).
- alldifferent3.conf: All variables should be different
- bolts.conf: Bolts. A machine learning example
- boyles_law.conf: Boyle's law.
- catalan.conf: Catalan numbers
- circle_1.conf : Circle
- duelling_cowboys.conf : Duelling cowboys
- equation.conf : The infamous
11x11=4, 22x22=16, 33x33=?
puzzle.
- equation3.conf : Sequence puzzle
- equation4.conf : Sequence puzzle
- exp_formula.conf: Test of Exp function
- exp_formula_no_exp.conf: Test of Exp function, but without
Exp
in the function list.
- facebook_puzzle.conf: Facebook puzzle (
1 + 4 = 5, 2 + 5 = 12, 3 + 6 = 21, 5 + 8 = ??
)
- facebook_puzzle_b.conf: Facebook puzzle (
1 + 4 = 5, 2 + 5 = 12, 3 + 6 = 21, 5 + 8 = ??
), alternative encoding
- facebook_puzzle2.conf: Another Facebook puzzle (
9 = 90, 8 = 72, 7 = 56, 6 = 42, 3 = ?
)
- facebook_puzzle2_b.conf: Another Facebook puzzle (
9 = 90, 8 = 72, 7 = 56, 6 = 42, 3 = ?
), alternative encoding
- fahrenheit_celsius.conf: Fahrenheit to Celsius conversion. You may experiment by changing
output_variable
to 0 for the reverse conversion (C -> F).
- fib1.conf: Fibonacci series as a time serie.
- fib2.conf: Fibonacci series as a time serie.
- fib_50.conf: Fibonacci numbers, where I try to find the closed formula for the Fibonacci number.
- func1.conf: Unknown function (from a homework in a course in Evolutionary Computing)
- galilei_acceleration.conf: Acceleration (BACON, via bacon-logtalk)
- gamma_test.conf: Test of Gammma function
- gelman.conf: Linear regression
- henon_100.conf: Henon, 100 data points
- heron_formula.conf: Heron formula
- intro_page_262.conf: A simple problem
- iris.conf: Iris data set
- isbn_test.conf: Trying to get the program to calculate the checksum for ISBN13
- leap_years.conf: Leap years
- logit_formula.conf: Test of the
Logistic
function
- longley.conf: Longley's data set of number employments
- majority_on_3.conf: Boolean 3-majority on
- mod_test.conf: Test of modulus operator.
- moons.conf: Moons data
- multiplexer_3.conf: 3-multiplexer (i.e. IfThenElse), using Boolean operatrs
- multiplexer_3_b.conf: 3-multiplexer (i.e. IfThenElse), using
if..then..else
- multiplexer_6.conf: 6-multiplexer, using
if..then..else
- multiplexer_11.conf: 11-multiplexer
- mysterious.conf: Mysterious function
- number_puzzle1.conf: Number puzzle from Roger Alsing's blog post Genetic Programming: Code smarter than you!
- number_puzzle4.conf: Number puzzle inspired by Richard Wiseman's It's the Friday Puzzle (2010-02-26) of finding the result 24 from the numbers 5,5,5,1 and the operators +,-,*,/. Note: the requirement that the numbers should be used exactly once is not held here. This uses only one fitness case (and the option
no_terminals
).
- odd_parity.conf: Odd parity, using the double variants of the boolean functions, i.e. AndD, OrD, NotD (see above)
- odd_parity_double.conf: Odd parity, using the arithmetic functions +,-,*,/
- odd_parity2.conf: Odd parity for two inputs
- p10.conf: Polynom P(10)
- p4.conf: Polynom P(4)
- p4_2.conf: Polynom P(4)
- p4_jgap.conf: Polynom P(4). This is the version in the JGAP example MathFormula.java
- p6_2.conf: Polynom P(6)
- pickover_puzzle1.conf: Sequence puzzle (via Cliff Pickover)
- planets.conf: Planets, i.e. Kepler's third law.
- puzzle3.conf: Sequence puzzle
- quintic.conf: Quintic polynomial
- regression_koza.conf: Regression (0.5 * x^2, from John R. Koza's Lisp implementation)
- regression_psh.conf: Regression (y = 12x^2 + 5, from Psh)
- seq_ind1.conf: Sequence induction problem: 5*j^4+4*j^3+3*j^2+2^j+1 (for integers 0..10)
- sequence_induction.conf: Sequence induction
- sequence_puzzle.conf: Sequence puzzle
- sequence_puzzle_recursion.conf: Sequence puzzles (e.g. Fibonacci)
- sigmoid_test.conf: Test of Sigmoid function
- sin_formula.conf: Test of Sine
- sin_formula_rand20.conf: Test of Sine
- sine_tiny_gp.conf: Test of Sine from TinyGP
- sorted_3.conf: Sorting 3 variables.
- sqrt_formula2.conf: Yet another test of Sine
- sqrt_formula3.conf: Yet another test of Sine
- sqrt_formula.conf: Test of Sqrt function
- sunspots.conf: Sunspots data as time series
- sunspots_timeseries.conf: Two version of sunspots data using
make_time_series
- test1.conf: A test of many functions.
- test2.conf: A test of new functions.
- tic_tac_toe.conf: Tic-tac-toe
- timeseries_test1.conf: Time series with
make_time_series
- timeseries_dailyisbn.conf: Time series using the classic time series "Daily closing price of IBM stock, Jan 1, 1980 to Oct. 8, 1992" (DAILYIBM.DAT from TSDL)
- triangular_numbers.conf: Triangular numbers
- weather.conf: Weather (classic classification example)
- x2.conf: A simply polynomial: x^2
- x4-x3+x2-x.conf: Polynomial: x^4-x^3+x^2-x
- zoo2.conf: Zoo (classic classification example)
The configuration parameters
The configuration file consists of the following parameters. Here is a short explanation; the full story is in the code: SymbolicRegression.java. Most of the parameters has reasonable default values, taken from either MathProblem.java or GPConfiguration.
-
#
, %
: Line comments; lines that start with the characters "#" or "%" will be ignored.
-
presentation
: A text which is shown first in the run.
-
num_input_variables
: Number of input variables in the data set.
-
output_variable
: The index (0-based) of the output variable. Default is the last variable.
-
variable_names
: The name of the variables, in order. Default is "V0", "V1", etc
-
data
: Starts the data
section, where each row is presented per line. The attributes may be separated by "," or some space. Decimal point is a .
(dot).
If a data row contains a ?
(question mark) in the position of the output variable, then it is considered a "user defined test" and the fittest program will be tested against this data last in the run.
-
terminal_range
: The range for the Terminal
as lower upper
. Note: Only one Terminal is used.
-
terminal_wholenumbers
: If the Terminal
should use wholenumbers or not (boolean)
-
constant
: Define a Constant
with this value
-
functions
: Define the functions, with the same name as in JGAP (or own defined functions).
-
adf_arity
: If > 0 then ADF is used. This is somewhat experimental as I am still try to understand how ADF:s works.
-
adf_function
: The functions used for ADF.
-
adf_type
: Either double or boolean. If set to boolean, we can use the boolean and logical operators.
-
max_init_depth
: JGAP parameter maxInitDepth
-
min_init_depth
: JGAP parameter minInitDepth
-
program_creation_max_tries
: JGAP parameter programCreationMaxTries
-
population_size
: JGAP parameter populationSize
-
max_crossover_depth
: JGAP parameter maxCrossoverDepth
-
function_prob
: JGAP parameter functionProb
-
reproduction_prob
: JGAP parameter reproductionProb
-
mutation_prob
: JGAP parameter mutationProb
-
crossover_prob
: JGAP parameter crossoverProb
-
dynamize_arity_prob
: JGAP parameter dynamizeArityProb
-
no_command_gene_cloning
: JGAP parameter no_command_gene_cloning
-
use_program_cache
: JGAP parameter use_program_cache
-
new_chroms_percent
: JGAP parameter newChromsPercent
-
num_evolutions
: JGAP parameter numEvolution
-
tournament_selector_size
: JGAP parameter tournamentSelectorSize
-
max_nodes
: JGAP parameter maxNodes
-
scale_error
: Sometimes the data values are very small which gives small fitness values (i.e. errors), making it hard to get any progress. Setting this parameter will multiply the errors by this value.
-
stop_criteria_fitness
: If set (>= 0) then the program will run "forever" (instead of num_evolution
) until fitness is less or equal to the value.
-
show_population
: This shows the whole population in each generation. Mainly for debugging purposes.
-
show_similiar
: Shows all the solutions (programs) with the same fitness value as the best solution. Alternative name: show_similar
.
-
similiar_sort_method
: Method of sorting the similiar solutions when using show_similiar
. Alternative name: similar_sort_method
. Valid options:
-
occurrence
: descending number of occurrences (default)
-
length
: ascending length of solutions
-
show_progression
: boolean. If true then the generation number is shown for all generations when nothing is happening (i.e. no gain in fitness).
-
sample_pct
: (float) Takes a (sample) percentage of the data set if > 0.0.
-
validation_pct
: Withheld a percentage of the test cases for a validation set. This fitness of this validation set is shown.
-
show_all_generations
: Show info of all generations, not just when fitness is changed.
-
hits_criteria
: Criteria of a hit: if the difference is <= this value, it is considered a hit. The number of non-hits is then used as a fitness measure instead of the sum of errors. Setting this function also shows the number of programs which is <= this value.
-
mod_replace
: Setting the replacement value of 0 (zero) for the ModuloIntD
function (see above).
-
showResults
: boolean. If set then all the fitness cases is shown with the output of the fitted program, with difference to the correct values.
-
resultPrecision
: the precision in the output used in showResult
, default 5
-
error_method
: Error method to use. Valid options are
-
totalError
: sum of (absolute) errors (default)
-
minError
: minimum error
-
meanError
: mean error
-
medianError
: median error
-
maxError
: max error
-
no_terminals
: If true then no Terminal is used, i.e. no numbers, just variables. Default false.
-
make_time_series
: Make a time series of the first line of data. The value of num_input_variable
determines the number of laps (+1 for the output variable)
-
make_time_series_with_index
: As make_time_series
with an extra input variable for the index of the series. (Somewhat experimental.)
-
minNodes: value penalty
: minimum number of nodes (terminals + functions). If the number of nodes in a program is less than value
then a penalty of penalty
is added.
-
alldifferent_variables: true/false penalty
: all the variables (terminals) should be different. If there is more than one occurrence of an variable in a program then a penalty of penalty
is added (for each extra variable).
-
ignore_variables
: (TBW) It would be nice to be able to ignore some variables in the data set. But this is yet to be written.
-
return_type
: (TWB) This should be the type of the "main" return value. Note: it is now hard coded in the program as double/DoubleClass
.
Supported function
The program has support for the following functions from JGAP. The "main" type is double so all functions are not applicable there (e.g. IfElse
etc). However, for the ADF functions (defined by setting adf_arity
to > 0) many more functions is supported. Please note that some of these are (very) experimental and maybe don't even make sense in this context.
-
Multiply
(double)
-
Multiply3
(double)
-
Add
(double)
-
Add3
(double)
-
Add4
(double)
-
Divide
(double)
-
Subtract
(double)
-
Sine
(double)
-
ArcSine
(double)
-
Tangent
(double)
-
ArcTangent
(double)
-
Cosine
(double)
-
ArcCosine
(double)
-
Exp
(double)
-
Log
(double)
-
Abs
(double)
-
Pow
(double)
-
Round
(double)
-
Ceil
(double)
-
Floor
(double)
-
Modulo
(double), implements Java's %
operator for double. See ModuloD for a variant
-
Max
(double)
-
Min
(double)
-
LesserThan
(boolean)
-
GreaterThan
(boolean)
-
If
(boolean)
-
IfElse
(boolean)
-
IfDyn
(boolean)
-
Loop
(boolean)
-
Equals
(boolean)
-
ForXLoop
(boolean)
-
ForLoop
(boolean) (cf the double variant ForLoopD)
-
Increment
(boolean)
-
Pop
(boolean)
-
Push
(boolean)
-
And
(boolean), cf the double variant AndD
-
Or
(boolean), cf the double variant OrD
-
Xor
(boolean), cf the double variant XorD
-
Not
(boolean), cf the double variant NotD
-
SubProgram
(boolean)
-
Tupel
(boolean)
Also, see my own defined functions defined in the Java files above.
Examples
Here are two small examples of the program, including the configuration file and a sample run.
Polynom
Here is a simple example of a configuration file. It happens to be the same problem as the JGAP example MathProblem, the polynom x^4 + x^3 + x^2 - x.
#
# Polynom x^4 + x^3 + x^2 - x
# The JGAP example
#
presentation: P(4) x^4 + x^3 + x^2 - x (the JGAP example)
num_input_variables: 1
variable_names: x y
functions: Add,Subtract,Multiply,Divide,Pow,Log,Sine
terminal_range: -10 10
max_init_depth: 4
population_size: 1000
max_crossover_depth: 8
num_evolutions: 800
max_nodes: 20
stop_criteria_fitness: 0.1
data
-2.378099 26.567495
4.153756 382.45743
2.6789956 75.23481
5.336802 986.33777
2.4132318 51.379707
-1.7993588 9.693933
3.9202332 307.8775
2.9227705 103.56364
-0.1422224 0.159982
4.9111285 719.39545
1.2542424 4.76668
1.5987749 11.577456
4.7125554 615.356
-1.1101999 2.493538
-1.7379236 8.631802
3.8303614 282.29697
5.158349 866.7222
3.6650343 239.42934
0.3196721 -0.17437163
-2.3650131 26.014963
A simple run, slightly edited.
It was 20 data rows
Presentation: P(4) x^4 + x^3 + x^2 - x (the JGAP example)
output_variable: y (index: 1)
input variable: x
function1: &1 + &2
function1: &1 - &2
function1: &1 * &2
function1: /
function1: &1 ^ &2
function1: log &1
function1: sine &1
function1: 1.0
[19:52:57] INFO GPGenotype - Creating initial population
[19:52:57] INFO GPGenotype - Mem free: 10.5 MB
[19:52:57] INFO GPPopulation - Prototype program set
[19:52:57] INFO GPGenotype - Mem free after creating population: 10.5 MB
Creating initial population
Mem free: 10.5 MB
Evolving generation 1/800, memory free: 6.7 MB (time from start: 0,42s)
Best solution fitness: 968.56
Best solution: x ^ 4.0
Depth of chrom: 1
Correlation coefficient: 0.9995009838030151
Evolving generation 4/800, memory free: 11.4 MB (time from start: 0,84s)
Best solution fitness: 813.62
Best solution: ((4.0 + 4.0) + (log 4.0)) * ((x ^ 3.0) / (9.0 / x))
Depth of chrom: 3
Correlation coefficient: 0.999500983803015
Evolving generation 6/800, memory free: 6.7 MB (time from start: 1,07s)
Best solution fitness: 712.97
Best solution: ((5.0 * x) + ((x ^ 4.0) + x)) + x
Depth of chrom: 4
Correlation coefficient: 0.999781550858714
Evolving generation 7/800, memory free: 40.3 MB (time from start: 1,20s)
Best solution fitness: 582.77
Best solution: ((9.0 * x) + (x * 9.0)) - ((9.0 - x) - (x ^ 4.0))
Depth of chrom: 3
Correlation coefficient: 0.9965296891627895
Evolving generation 8/800, memory free: 24.9 MB (time from start: 1,32s)
Best solution fitness: 471.58
Best solution: (((x + 7.0) * (x * x)) * x) * (x / 9.0)
Depth of chrom: 4
Correlation coefficient: 0.9980245718601822
Evolving generation 11/800, memory free: 29.6 MB (time from start: 1,69s)
Best solution fitness: 364.73
Best solution: ((x * 8.0) + ((x ^ 4.0) + x)) - (4.0 - ((x * x) * x))
Depth of chrom: 4
Correlation coefficient: 0.9988207761993971
Evolving generation 16/800, memory free: 33.2 MB (time from start: 2,24s)
Best solution fitness: 317.84
Best solution: ((x * ((x * x) + (log 9.0))) * x) + (x * 9.0)
Depth of chrom: 5
Correlation coefficient: 0.9993672319814505
Evolving generation 17/800, memory free: 19.4 MB (time from start: 2,38s)
Best solution fitness: 169.76
Best solution: (x ^ 4.0) + (x * (x * x))
Depth of chrom: 3
Correlation coefficient: 0.9999752402614274
Evolving generation 22/800, memory free: 24.2 MB (time from start: 3,10s)
Best solution fitness: 136.21
Best solution: ((x * x) * x) + (x + (x * (x * (x * x))))
Depth of chrom: 5
Correlation coefficient: 0.9999485732269622
Evolving generation 23/800, memory free: 10.4 MB (time from start: 3,20s)
Best solution fitness: 3.7509724195622374E-4
Best solution: (x * ((((x * x) + x) * x) + x)) - x
Depth of chrom: 6
Correlation coefficient: 0.9999999999999939
Fitness stopping criteria (0.1) reached with fitness 3.7509724195622374E-4 at generation 23
All time best (from generation 23)
Evolving generation 23/800, memory free: 10.4 MB (time from start: 3,20s)
Best solution fitness: 3.7509724195622374E-4
Best solution: (x * ((((x * x) + x) * x) + x)) - x
Depth of chrom: 6
Correlation coefficient: 0.9999999999999939
Total time 3,20s
Fibonacci series
Here is another example, Fibonacci series as time serie. The object is to give a symbolic regression on the fourth value (F4), which is solved quite fast.
#
# Fibonacci with 3 variables
#
presentation: This is the Fibonacci series
return_type: DoubleClass
num_input_variables: 3
variable_names: F1 F2 F3 F4
functions: Multiply,Divide,Add,Subtract
terminal_range: -10 10
max_init_depth: 4
population_size: 20
max_crossover_depth: 8
num_evolutions: 100
max_nodes: 21
show_similiar: true
show_population: false
stop_criteria_fitness: 0
data
1,1,2,3
1,2,3,5
2,3,5,8
3,5,8,13
5,8,13,21
8,13,21,34
13,21,34,55
21,34,55,89
34,55,89,144
55,89,144,233
89,144,233,377
144,233,377,610
233,377,610,987
377,610,987,1597
610,987,1597,2584
987,1597,2584,4181
1597,2584,4181,6765
2584,4181,6765,10946
4181,6765,10946,17711
6765,10946,17711,28657
10946,17711,28657,46368
And a sample run:
Presentation: This is the Fibonacci series
output_variable: F4 (index: 3)
input variable: F1
input variable: F2
input variable: F3
function1: &1 * &2
function1: /
function1: &1 + &2
function1: &1 - &2
function1: -5.0
[19:56:31] INFO GPGenotype - Creating initial population
[19:56:31] INFO GPGenotype - Mem free: 7.5 MB
[19:56:31] INFO GPPopulation - Prototype program set
[19:56:31] INFO GPGenotype - Mem free after creating population: 7.5 MB
Creating initial population
Mem free: 7.5 MB
Evolving generation 1/100, memory free: 6.7 MB (time from start: 0,03s)
Best solution fitness: 17649.74
Best solution: (F3 - (-6.0 - F1)) + ((F2 - F2) - (-6.0 / F1))
Depth of chrom: 3
Correlation coefficient: 0.9999999838289084
Evolving generation 6/100, memory free: 6.0 MB (time from start: 0,13s)
Best solution fitness: 147.0
Best solution: 7.0 + (F2 + F3)
Depth of chrom: 2
Correlation coefficient: 1.0
Evolving generation 7/100, memory free: 5.8 MB (time from start: 0,14s)
Best solution fitness: 0.0
Best solution: F2 + F3
Depth of chrom: 1
Correlation coefficient: 1.0
Fitness stopping criteria (0.0) reached with fitness 0.0 at generation 7
All time best (from generation 7)
Evolving generation 7/100, memory free: 5.8 MB (time from start: 0,15s)
Best solution fitness: 0.0
Best solution: F2 + F3
Depth of chrom: 1
Correlation coefficient: 1.0
Total time 0,15s
All solutions with the best fitness (0.0):
F2 + F3 (1)
Todo
Here are some TODO:s, or things nice to have.
- option for ignoring specific variables
- option for stopping:
- running forever
- after a specific time,
-
when a specific fitness value is reached Fixed (by stop_criteria_fitness
)
-
calculate the number of data rows automatically (i.e. skip num_row) Fixed.
- accept nominal values in the data section; then converted to numeric values.
-
show some progress output. Fixed (by show_progress
)
-
make the output from this program instead from log4j. Fixed.
-
for larger data sets it would be nice with an option to use just
a sample (percentage) of the full data. Fixed (by sample_pct
)
-
Withheld a certain percentage for validation of the best fitted program. Fixed (by validation_pct
)
-
show number of programs less or equal a hits critera Fixed (by hits_criteria
)
-
show info for all generations Fixed (by show_all_generations
)
-
support for user defined test instances Fixed (by ?
in a data row)
-
show number of functions / terminals of the best program Fixed.
- add more fitness metrics.
- better handling of punishing longer solutions (parsimony pressure).
- support for different "main" return classes, i.e. not just DoubleClass
- more statistical measures, e.g. R-squared, mean squared error, mean absolut error, minimum error,
maximum error
- more/better error checks
- more building blocks, a la Eureqa http://ccsl.mae.cornell.edu/eureqa_ops
- support for derivattives (a la Eureqa)?
- incorporate in Weka?
- simplify the best solution with a CAS?
See also
Mention/Usage
Here are some mentions or usage of my Symbolic Regression program:
This page was created by Hakan Kjellerstrand (hakank@bonetmail.com), homepage.