CAMELLIA (THEACEAE) CLASSIFICATION WITH SUPPORT VECTOR MACHINES BASED ON FRACTAL PARAMETERS AND RED, GREEN, AND BLUE INTENSITY OF LEAVES

Leaf traits are commonly used in plant taxonomic applications. The aim of this study was to test the utility of fractal leaf parameters analysis (FA) and leaf red, green, and blue (RGB) intensity values based on support vector machines as a method for accurately discriminating Camellia (68 species from five sections, 11 from sect. Furfuracea, 13 from sect. Paracamellia, 15 from sect. Tuberculata, 24 from sect. Theopsis and 5 from sect. Camellia). The results showed that the best classification accuracy was up to 96.88% using the RBF SVM classifier (C = 16, g = 0.5). The linear kernel overall accuracy was 90.63%, and the correct classification rates of 40.63% and 93.75% were achieved for the sigmoid SVM classifier (C = 16, g = 0.5) and the polynomial SVM classifier (C = 16, g = 0.5, d = 2), respectively. A hierarchical dendrogram based on leaf FA and RGB intensity values was mostly on agreement with the generally accepted classification of the Camellia species. SVM combined with FA and RGB may be used for rapidly and accurately classifying Camellia species and identifying unknown genotypes. Introduction Camellia L. is a commercially important genus of family Theaceae. It is cultivated globally, particularly in tropical and subtropical regions of East and Southeastern Asia (Ming, 2000; Gao et al., 2005; Lu et al., 2012). Some Camellia species are used to produce tea, others are cultivated as ornamental plants, and the seeds of some species are used for making edible oils (Chen et al., 2005; Vijayan et al., 2009; Jiang et al., 2012). Currently, there are number of discrepancies in relation to classification of species from this important genus. There are three popular Camellia monographs developed by Sealy (1958), Chang (1998) and Ming (2000) that differ significantly in species, section and subgenus arrangement. All these taxonomic classifications are based on the morphology. Many studies have shown that classifications purely based on the traditional morphological characteristics are insufficient for closely related species because low divergence prevents having reasonable qualitative features to support the taxonomic systems (Bari et al., 2003; Lu et al., 2008a,b; Pandolfi et al., 2009; Jiang et al., 2010). As a result, there is no concordance in the method for classification of Camellia and further taxonomic research is necessary (Pi et al., 2009). Leaf characters have been successfully exploited to solve plant taxonomy problems (Plotze et al., 2005, Lin et al., 2008; Ye and Weng, 2011). Traditionally, leaf traits such as shape (Ming, 2000), morphology research (Barthlott et al., 2009), and leaf anatomy (Pi et al., 2009; Jiang et al., 2010) have been used for classification. Recently, several researchers have used fractal parameters * Corresponding author. Email: luhongfei0164@163.com 1 Zhejiang Institute of Subtropical Crops, Zhejiang Academy of Agricultural Sciences Wenzhou 325005, China. 2 School of Applied Sciences, Health Innovations Research Institute, RMIT University, Melbourne 3000, Victoria, Australia. DOI: http://dx.doi.org/10.3329/bjpt.v24i1.33034


Introduction
Camellia L. is a commercially important genus of family Theaceae. It is cultivated globally, particularly in tropical and subtropical regions of East and Southeastern Asia (Ming, 2000;Gao et al., 2005;Lu et al., 2012). Some Camellia species are used to produce tea, others are cultivated as ornamental plants, and the seeds of some species are used for making edible oils (Chen et al., 2005;Vijayan et al., 2009;Jiang et al., 2012). Currently, there are number of discrepancies in relation to classification of species from this important genus. There are three popular Camellia monographs developed by Sealy (1958), Chang (1998) and Ming (2000) that differ significantly in species, section and subgenus arrangement. All these taxonomic classifications are based on the morphology. Many studies have shown that classifications purely based on the traditional morphological characteristics are insufficient for closely related species because low divergence prevents having reasonable qualitative features to support the taxonomic systems (Bari et al., 2003;Lu et al., 2008a,b;Pandolfi et al., 2009;Jiang et al., 2010). As a result, there is no concordance in the method for classification of Camellia and further taxonomic research is necessary .
In addition, leaf colour information provides useful data for judging maturity of agricultural products (Gunasekaran et al., 1985), detecting diseases (Howaith et al., 1990), and fruit sorting (Harrell et al., 1989). Thus, the leaves really provide plenty of characteristics that can be used as a source of data for plant taxonomy (Yang and Lin, 2005).
Supervised techniques are one of the most effective analysis tools in classification field currently . These tools apply available information about a category membership of samples to developed model for classification of the genus. Support vector machines (SVM) is a supervised pattern recognition technology which has the algorithm developed in the machine learning community and is capable of learning in high-dimensional feature spaces (Cortes and Vapnik, 1995;Lu et al., 2011). The standard SVM takes a set of input data and predicts, for each given input, which of two possible classes the input is a member of, which makes the SVM is a non-probabilistic binary linear classifier. Recently, SVM has been used in a variety of areas like information retrieval (Jain et al., 1999), object recognition (Pontil and Verri, 1998), food bruise detection , qualitative assessment of tea (Chen et al., 2008), and fruit classification (Zheng et al., 2010). Chen et al. (2007) demonstrated that SVM fixes the classification decision function based on structural risk minimum mistakes instead of the minimum mistake of the misclassification on the training set to avoid over-fitting problem. Compared to other pattern recognition tools such as artificial neural networks (ANNs), SVM is a powerful method with a higher training speed and can avoid overtraining (Jack and Nandi, 2002;Kumar et al., 2011). In addition, Burges (1998) suggested that SVM could get the best solution of data set with better ability of generalization.
So far there is no knowledge about the utility of leaf image analysis and machine learning as a taxonomic toolkit for classification of genus Camellia. In this study, we combine the fractal leaf parameters and leaf red, green, and blue intensity values with SVM to analyze the taxonomical classification of Camellia plants. The main objective of this work was to (a) develop and evaluate the effectiveness of SVM for identifying 68 species in genus Camellia, and (b) confirming these relationships based on fractal parameters and red, green, and blue (RGB) intensity values of leaves. Our purpose is to provide a potential tool for accurate classification of Camellia species.

Materials
All plant materials were collected from the International Camellia Garden in Jinhua, Zhejiang Province (29°07′ N, 119°35′ E, 40 m in altitude) in July 2011. All plants share the same environment in this garden which reduces the major effect of geographical distribution on leaf development. Healthy leaf samples following Chang's taxonomic treatment (1998), 11 species from sect. Furfuracea, 13 species from sect. Paracamellia, 15 species from sect. Tuberculata, 24 species from sect. Theopsis, and five species from sect. Camellia, for a total of 68 species were examined, and split into two groups: 36 for training phase of SVM model construction and the other 32 for the validation phase (Table 1). All samples were taken from the third mature leaves that was fully exposed to sunlight and horizontally arranged on the two-year-old branches of the plants. At least three plants per species were selected. Means of data were obtained using SAS version 9.0 (SAS Institute, Cary, NC, USA). Voucher specimens for all species were deposited in the Chemistry and Life Science College of Zhejiang Normal University (ZJNU) (see Appendix 1 for voucher details).

Image acquisition and fractal parameters
A Canon EOS 50D camera with a Canon EF-S 18-55 mm f/3.5-5.6 IS lens at 50 mm, was used to acquire leaf images. All image acquisition was carried out at least in five and the lighting for images was entirely from natural light on a sunny summer morning. Leaf fractal parameters were calculated using fractal image analysis software (HarFA, Harmonic and Fractal Image Analyzer 5.4) as previously described by Mancuso (2002), Pandolfi et al. (2009) andZheng et al. (2011). Briefly, Figure 1 shows schematic diagram of HarFA output and five parameters in detail. The basic procedure was as follows: (1) each Camellia leaf image was split into the constituent color channels (red, green, blue); (2) each channel was set for a threshold color value between 0 and 255; (3) the fractal dimension (D) for red, green, and blue channel was calculated by box counting method; (4) then the D which is presented as a function of thresholding condition in fractal spectrum was plotted against the colour intensity to obtain the fractal spectra of the three channels; (5) determining the baseline (D = 1) that separates the fractal (D > 1) from the nonfractal (D < 1) zone of the spectrum. For this study, we selected D = 1.2 as the baseline. (6) Finally, the five fractal parameters (X 1 , X 2 , X, Y, and S) were determined by Origin Lab (version 8.0). Additionally, average RGB intensity values from Camellia images were assessed using the colour histogram tool of Image J (National Institutes of Health, Bethesda, MD).

Cluster analysis
As a method of grouping data based on attributes of given population into similar and dissimilar groups, we conducted clustering analysis to classify 68 species in genus Camellia based on 15 fractal parameters and average RGB intensity values of leaf and compared it to Chang's (1998) results. A hierarchical dendrogram was constructed using Unweighted Pair-Group Method with Arithmetic Mean analysis (UPGMA). The Gower General Similarity Coefficient was applied to address multi-dimensional scaling. The multivariate statistical package (Version 3.13n, Kovach Computing Services) was used to conduct the cluster analysis.

SVM analysis
Support vector machine (SVM) was first proposed for pattern recognition applications by Vapnik (1995) based on statistical learning theory. The classification mechanism of SVM can be described as simple as: SVM tries to create an appropriate boundary (hyperplane) that meets the requirements of classification, the distance between the boundary and the nearest data points (support vectors) are maximal while the classification precision is also guaranteed. Theoretically, SVM can realize the optimal classification of linearly separable data. In order to solve non-linear problem, SVM converts the data from a low dimension input space to a high dimension feature space through a transformation function (kernel function).
All SVM algorithms are implemented with LIBSVM (Version 3.0) under MATLAB software (The Mathworks, Inc., Natick, MA, USA, version 7.9 R2009b). The LIBSVM is a library for support vector machines (2001).

The fractal dimension and RGB intensity values of species
As shown in the flow chart ( Fig. 1), for each species, the five fractal parameters (X 1 , X 2 , X, Y, S) were derived from the fractal spectra of each (red, green, and blue) colour channels (15 variables). The fractal values obtained for different Camellia species belonging to sections Furfuracea, Paracamellia, Tuberculata, Theopsis, and Camellia are shown in Figs. 2-4). These RGB intensity values were shown in Table 2. Thus, 18 input variables were obtained for modeling.

Unsupervised cluster analysis
The relationship between the 68 Camellia species was examined by constructing a dissimilarity dendrogram using the 18 variables described above (Fig. 5). The species classified under sect. Theopsis by Chang (1998) clustered together (number 40 to 63) in the current study. Further, species number 12 to 24 and number 28 grouped together as an independent branch, which is also mostly congruous with Chang's treatment of sect. Paracamellia. Species number 1 to 11 belonging to sect. Furfuracea according to Chang's taxonomy also clustered together. However, two species, viz. C. tuberculata and C. obovatifolia from sect. Tuberculata also clustered with them. The other sect. Tuberculata species clustered together apart from C. rhytidophylla that clustered with sect. Paracamellia. Finally, species from sect. Camellia clustered together apart from C. xiafongensis that clustered with sect. Theopsis.

Support vector machine (SVM) classification accuracy
The training set and test set of SVM model is presented in Table 1. The class designation is important for training of SVM algorithms. The 68 species analyzed in the current study were divided into five categories, so the class designation followed the predefined Chang's (1998) taxonomy. Two SVM parameters namely regularization parameter (C) and kernel parameter (g), which are the keys to obtain good model performance, are optimized by cross validation. In current work, log 2 C and log 2 g were distributed from -5 to 5 with increments of 0.5. As seen in Fig.  6, the highest average accuracy of 83.33% was achieved when C = 16 and g = 0.5 for the training data set. The parameter of polynomial SVM were the combinations of another polynomial degree (d) with . The classification results of linear, radial basis function (RBF), and  Table 1.  Table 1. sigmoid SVM models, with optimal parameters of C and g are presented in Fig. 7     performance of the RBF SVM classifier. The linear SVM classifier for five sections shows correct classification rate of 90.63% (sect. Furfuracea-100%, sect. Paracamellia-100%, sect. Tuberculata-57.14%, sect. Theopsis-100%, sect. Camellia-100%), but the sigmoid kernel overall accuracy for the test data set is worse than any other classifiers with only 40.63% (sect. Furfuracea-0%, sect. Paracamellia-33.33%, sect. Tuberculata-0%, sect. Theopsis-91.67%, sect. Camellia-0%). For polynomial classifiers, in fact, it is a linear classifier when polynomial degree d = 1. The classification results of polynomial SVM classifier with different degrees from 2 to 9 are shown in Table 4. The polynomial SVM classifiers with d =2 achieved the best overall classification accuracies (93.75%) of the five sections (sect. Furfuracea-100%, sect. Paracamellia-100%, sect. Tuberculata-71.43%, sect. Theopsis-100%, sect. Camellia-100%). In addition, the active effect on the classification accuracies was very less when d was greater than 2, with increasing polynomial degree, the classification accuracies take on a descending trend (Table 4).

Discussion
Plant numerical taxonomy applies numerical methods or supervised techniques like SVM in the classification of taxonomic units. It converts the information content of taxa to numerical quantitative and its aim is in its objectivity. Thus, developing a taxonomic toolkit is becoming an indispensable aid in modern systematics. Traditionally, leaf characters have been used as a basis for plant taxonomy and they have been successfully used to solve plant classification problems (Linnaeus, 1753). Contemporary classification especially for genus Camellia, have involved use of advanced technology tools. Some examples are, classification within genus level based on simulated annealing aided cloud classifier (Pi et al., 2011); use of genetic information with molecular biotechnology tools; fourier transform infrared spectroscopy (FTIR) combined with shape and anatomy analysis of Camellia leaves (Lu et al., 2008b;Shen et al., 2008), which suggested that the chemical method also had important taxonomic significance. However, as shown in Table 5, some of these methods are laborious and expensive, and do not always guarantee satisfactory results. Moreover, a defect common to all the approaches (Table 5) is that they get quantitative features of plant is based on damaging leaves. However, fractal analysis and RGB intensity values combined with support vector machine (SVM) used in our study are not only non-destructive, but are simple, and easily performed. The fractal spectrum was introduced as a botanical identification key by Mugnai et al. (2008). Actually, leaf colour is a very special characteristic but often ignored by taxonomists. Camellia species are both trees and shrubs, and plant height and leaf feature may interfere with plant photosynthesis. The chlorophyll content in turn is correlated to the leaf colour . Moreover, the long-term evolution of Camellia species have made them a stable system, therefore they can be classified based on leaf traits like chlorophyll content. Chang (1998) and Ming (2000) are two comprehensive floras prominently used by Camellia researchers. People often turn to flora to identify a new species; however, traditional information retrieval processes is frequently cumbersome. Further, some basic characteristics can only be manually identified which needs experience and is often subjective. These limitations can be overcome by developing an automated method of plant identification which is rapid and efficient. We have developed an automated method using leaf fractal parameters in SVM model to classify 68 Camellia species. The taxonomic results are very encouraging allowing us to achieve accuracy of up to 96.88% using the RBF fractal values. As a modern pattern recognition tool, the SVM is advantageous over other methods like back-propagation artificial neural network (BP-ANN). The common problem with neural networks is the networks structure; BP-ANN may suffer from the over-fitting problem because its approaches are based on the empirical risk minimization principles. Comparatively, the over-fitting can be easily controlled in SVM by choosing a suitable margin to get the best resolution of entire data set (Burges, 1998). In addition, SVM does not need a great quantity of training sets for developing model.
Our results were mostly congruent with Chang's (1998) classification of Camellia species with some differences. However, it should be noted that other researchers have also reported deviations from Chang's classification. For example, when our results are compared to Camellia classification by Vijayan et al. (2009), the general agreement in classification of the 68 Camellia species indicates the usefulness of fractal parameters and RGB intensity in detecting phylogenetic relationships. For the plants from sect. Furfuracea and sect. Theopsis, all collected species from two sections were joined and intermixed respectively (Fig. 5), which is in agreement with the classification by Vijayan et al. (2009). In addition, our results support the grouping of C. yuhsienensis (No. 13, from sect. Paracamellia) and C. rhytidophylla (No. 28, from sect. rhytidophylla) together as reported by Vijayan et al. (2009). This is however different from Chang's (1998) treatment of these two species. Further, as shown in Fig. 5, species from sect.
Paracamellia grouped together, whilst Vijayan et al. (2009) taxonomic treatment advocates these species as three clades. In analyzing results from the SVM classifiers, we found that the species number 12 (C. grijsii) from sect. Paracamellia was incorrectly classified as a species from sect. Furfuracea by all SVM classifiers [linear, RBF, sigmoid, and polynomial (d = 2) classifiers]. The deviation from this classification needs further investigation to see if this misclassification is due to the underlying algorithm's fitting of the data, or C. grijsii really has a close relationship with sect. Furfuracea.
In addition, high quality seeds are the key to develop the modern agriculture, it is necessary to select good seed varieties for improvement of crops yield. An elite variety with greater benefits should replace the variety with inferior quality seeds. Bacchetta et al. (2011) identified Sardinian species of Astragalus section Melanocercis by seed image analysis. Developing countries are still using traditional manual seed separation method. In this context, the application of SVM based on fractal leaf parameters analysis (FA) and leaf red, green, and blue (RGB) intensity values used in the present study is not only proposed as a complementary method for botanical identification, but also proposed as a modern method of good seed selection. The SVM-FA-RGB system is very simple to establish and requires only a personal computer and an optical scanner. Therefore it could potentially replace old methods that are complicated, labour-intensive and expensive.

Conclusion
We have developed a system for automatic binary classification of 68 Camellia species into five sections based on SVM and discussed the important features of this classification. The hierachical dendrogram based on fractal parameters and RGB intensity values confirms the morphological classification of the five sections proposed by Chang's (1998) research. The linear, polynomial (d = 2), RBF SVM classifier with C = 16, g = 0.5 work well in the classification of the genus Camellia. Especially RBF SVM classifier showed encouraging results that obtaining a correct classification rate of 96.88%. The above results indicate that fractal parameters and RGB intensity values analysis using SVM, particularly RBF kernel, can be effectively used to distinguish the Camellia at genus level, or even at higher taxa level. In addition, the SVM-FA-RGB system could be used to select high quality seeds in agriculture breeding programs.