3.3 Significance analysis of epitope amino acids
A discriminant analysis model was used to analyze the relationship between the allergenicity (yes or no, Y- variable) of the epitope peptides and the physical and chemical properties of the amino acids (X- variables).
The present study used 3 descriptors to describe the amino acid physical and chemical properties of the T cell epitopes for building the random forests models with variable importance analysis. The confusion matrix (validation sample) correct (%) of seven soybean allergens were: 40% (P01070), 57.143% (P04347), 41.667% (P04776), 55.556% (P05406), 72.727% (P11827), 60% (P25974), and 75% (P26987). The variable importance of the X-variables was determined by examining the mean decrease accuracy obtained through random forests analysis of the quantitative X- variables and qualitative Y-variable (Fig. 2).
As shown in Fig. 2, the variable p1z1 significantly contributed to the allergenicity (yes) in the five soybean allergens (P01070, P04776, P05406, P11827, P25974). According to mean decrease accuracy values, variables p2z1, p6z2, p13z3 were beneficial to the allergenicity (yes) in four soybean allergens. Through the calculation, the occurrence of allergenicity (yes), p1, and p6 was the most important position to allergenicity (yes), followed by p2, p4, p5, and p13. The soybean allergens P01070, P11827, and P25974 expressed the bulk of the amino acid at the p1 position, whereas P04776 and P05046 expressed the electronic property of the amino acid at the p1 position. The amino acids at the p6 position can have a good contribution to allergenicity (yes) in the six allergens except for P04776. Especially, for the allergen P26987, the hydrophobicity, bulk, and electronic property of amino acid at position p6 promoted the allergenicity (yes).
In the soybean allergens including P04347, P11827, P25974, P26987, the most important amino acid property for allergenicity (yes) is z1 (hydrophobicity), followed by z2 (bulk) and z3 (electronic property). Electronic property is the most important amino acid property for the allergen P05046, whereas bulk is the most important amino acid property for the allergens of P01070 and P04776.
Except for the Y- variable allergenicity (yes), the random forest models also provided the relationship between X- variable and Y- variable allergenicity (total) to find important variables. From the Fig. 2, we found that the variable p1z1 can affect both allergenicity (no) and allergenicity (yes) in most allergens (P01070, P04776, P05406, and P25974), and the variable p6z1 contributed to both allergenicity (no) and allergenicity (yes) in three allergens (P04347, P11827, and P26987).