It does not include redundant records in the train
set, so the classifiers will not be biased towards more frequent records.
There is no duplicate records in the proposed test
sets; therefore, the performance of the learners are not biased by the methods which
have better detection rates on the frequent records.
The number of selected records from each difficultylevel
group is inversely proportional to the percentage of records in the original KDD
data set. As a result, the classification rates of distinct machine learning methods
vary in a wider range, which makes it more efficient to have an accurate evaluation
of different learning techniques.
The number of records in the train and test sets
are reasonable, which makes it affordable to run the experiments on the complete
set without the need to randomly select a small portion. Consequently, evaluation
results of different research works will be consistent and comparable.