Cite this as
Herrero R, Martinez-Diaz Y, Mendez-Vazquez H, Nieves J, Gonzalez A (2023) Face morphometric profiles of groups as early markers for certain diseases? Int J Oral Craniofac Sci 9(2): 008-015. DOI: 10.17352/2455-4634.000060Copyright License
© 2023 Herrero R, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Background: Face morphometry has been shown to work as a diagnosis tool in a set of syndromes. Face similarities are usually indications of more complete genetic similarities.
Purpose: To show preliminary results on the face morphometry profile of the Cuban population and to argue that it could be used to define early markers for diseases like Alzheimer’s.
Methods: A dataset composed of photos of 200000 men is processed. Facial landmarks are extracted by means of the DLIB library and distances between them are computed. By clustering samples with similar facial traits, groups are formed and their densities inside the population are computed.
Results: The face morphometry profiles for two age cohorts are obtained, showing the population dynamics. Genes involved in facial development are shown to be related to Alzheimer’s disease.
Conclusion: Late multifactorial diseases develop against the genetic background of each individual, which is expressed by its face morphometry. The latter can be thus considered a risk marker.
AD: Alzheimer’s Disease; FM: Face Morphometry; PCA: Principal Component Analysis; TCGA: The Cancer Genome Atlas
Population-wide studies, such as infant height and weight tables [1,2], are useful tools for the early detection of metabolic disorders or any other kind of disruption of normal state in children. By comparing the infant’s parameters with the reference values, we get a first signal of a possible disorder and may proceed to further studies. At present, when the molecular approach to medicine is prevailing more and more in clinical analyses [3-6], genetic and epigenetic markers [7-9] and their reference values in a population become extremely important in many diseases.
However, genetic studies are still relatively expensive [10-12]. In certain situations, in particular the “trivial” Down’s syndrome, a direct examination of the face is enough for diagnosis. Less known, but similarly strong as a marker, is the diagnosis of Down’s syndrome from fingerprint patterns [13]. Both markers: face characteristics and fingerprints, are easily obtained. In a sense, they are indirect markers, that integrate genetic and epigenetic information [14]. As an example of papers using Face Morphometry (FM) for the diagnosis of a set of syndromes, we may cite Refs. [15-19].
Instead of pretending a diagnosis from indirect markers, we may try to define risk groups inside the population. It is well known that, in many diseases, the risk has an ethnic or group component [20-22]. Consider, for example, the following cancer risk data from 423 cancer registries in the world (Ref. [23], Supplementary information). A Principal Component Analysis (PCA) [24,25] of the data shows that ethnic and cultural groups exhibit distinct patterns, as is apparent in (Figure 1).
For a given population, in particular the Cuban population, mixing between groups could be an element hindering ethnic origins or other factors. A more detailed analysis based on FM measurements, for example, could be a valuable tool to identify groups inside the population and, indirectly, multivariate markers for predisposition to certain diseases.
We performed a preliminary analysis of FM data of the Cuban population. A group of 200,000 randomly chosen persons, were studied. The data we receive contains only a set of two-dimensional vectors (face landmarks) obtained from the widely used Dlib library [26-30]. Additionally, popular Python data analysis libraries such as NumPy [31], Pandas [32,33], Matplotlib [34], Scikit-Learn [35] and MLxtend [36] were used.
As mentioned, a dataset of random 200,000 facial images of Cuban men was processed by means of the DLIB software. The images correspond to two age cohorts - the first comprising men born between 1940 and 1960, and the second including men born between 1961 and 1980. Racial information was also available for each image, but not used.
The landmark localization model was unsuccessful in extracting facial landmarks for 1,968 images in the 40 - 60 years cohort and 470 images in the 61 - 80 years cohort. Combined, these landmark detection failures accounted for approximately 1% of the total image dataset. As such, the missing landmark data is not expected to substantially impact downstream analyses.
Facial landmark localization was performed using the 68-point face model from the DLIB library. A representation of such points is shown in (Figure 2) top panel [37]. After extracting the landmark coordinates from each image, the landmarks were filtered to select only those representing relevant facial features that accurately reflect the underlying osseous structure. They are represented in (Figure 2) the bottom panel in an image of the public UTKFace dataset (https://susanqq.github.io/UTKFace/, [38]). This accuracy is captured by landmarks that are the most stable against variations in facial expression. Consider, for example, the analysis of eye landmarks. Point 38 would vary with the openness of the eye, while point 39 maintains its position despite such variations. Selecting stable points according to these criteria ensures important noise reduction and guarantees better reproducibility of results across different images of the same person.
Based on these stability criteria, the selected DLIB landmarks were: 0, 4, 7, 8, 9, 12, 16, 17, 19, 21, 22, 24, 26, 27, 30, 31, 33, 35, 36, 39, 42, 45, 48, 51, 54, 57, for a total of 26 points per image.
Let us stress that our landmarks coordinate inference script was tested with public databases. We received from a national database an anonymized pull of numbers (the 68 Dlib landmarks coordinates), respecting in this way the confidentiality of image data and making possible a kind of epidemiological study.
To ensure all facial images exhibit a frontal pose and neutral expression, two filtering criteria were imposed. First, the frontal pose was quantified through a symmetry coefficient defined as s = distance (0,8) / distance (16,8), requiring that 0.90 < s < 1.10, where distance (0,8) denotes the distance between landmarks 0 and 8, for example.
Second, mouth closure was assessed via a coefficient given by: m = y66 / y62, requiring that m < 1.02. In this case, y62 and y66 correspond to the vertical coordinates of the central upper and lower lip landmarks 62 and 66. Tighter clustering of these landmarks indicates mouth closure. Notice that these landmarks are used only for filtering purposes.
Rather than absolute facial landmark coordinates, the distances between landmarks are more informative. Two highly stable landmarks were chosen as reference points, with all other points defined by their normalized distances to these references. Specifically, the outer eye corners (points 36 and 45) served as the reference landmarks in this study, with the distance between them constituting the reference distance for a given image. Each remaining landmark distance was normalized to this image-specific reference distance, resulting in comparable measurements across the dataset.
To reduce dimensionality, symmetric and homologous landmark distances were consolidated by averaging. Symmetric distances were defined as those between landmarks along the mid-sagittal plane (points 8, 27, 30, 33, 51, and 57) and each of the references (e.g. d {8-36} and d_{8-45}).
Equivalent distances were defined as those between symmetrical counterpart landmarks across the mid-sagittal facial plane and the reference landmark on the opposite side of the face (e.g. d_{12-36} and d_{4-45}). Equivalent landmark pairs consisted of (0, 16), (4, 12), (7, 9), (17, 26), (19, 24), (21, 22), (31, 35), (39, 42) and (48, 54). After averaging symmetric and equivalent distances, each facial image was characterized by a 24-dimensional distance vector.
To reduce dimensionality and decorrelate the data, principal component analysis (PCA) was performed on the distance data. The facial data samples were then projected onto the first 5 principal components, which account for nearly 90% of the variance. The vectors defining the principal components for the 40 - 60 years age cohort are used in the 61 - 80 cohort as well in order to compare them. The space was divided into equal-width intervals. The number of sub-divisions in each component is taken roughly proportional to its variance. Specifically, the first principal component was partitioned into 5 intervals and the second into 3 intervals, resulting in a total of 15 cells partitioning the PC1 vs. PC2 plane, and thus the whole space. Samples are counted in each cell in order to determine the incidence per 1000 individuals and construct the 2D density map.
As mentioned above, we performed a preliminary analysis of FM data of the Cuban population. To the best of our knowledge, this is the first FM study in a population. Two cohorts of men were studied. The data we receive contains only a set of two-dimensional vectors (face landmarks) obtained from the widely used, Dlib library. We select 26 landmarks based on their importance in the underlying osseous structure and stability with respect to changes in facial expression. From pairs of face points in a normalized image we compute distances. At the end, a vector of 24 distances comes up from each original photo.
The distance data is processed by means of PCA. The first 5 components are shown to account for nearly 90 % of data variance. We restrict ourselves to these 5 components and divide the PCA space into 15 cells corresponding to well-defined groups of similar face characteristics. The results are shown in (Figure 3) in the form of a histogram comparing the two cohorts. The x-axis is a number labeling the cell, whereas the y-axis is the population density per 1000 inhabitants. That is, if the column height for a given cell is 300, for example, there are 300 men with these facial characteristics per 1000 inhabitants.
We notice that there are measurable differences between cohorts in spite of the relatively small time lapse between them, 20 years. We checked by a kind of bootstrap analysis that in the cells with high differences error bars are small enough, thus differences are not artifacts. Our hypothesis is that the differences signal a change in the mixing dynamics inside the population after the event of the Cuban revolution in 1959, with the abolishment of racial prejudices. As mentioned, these results are preliminary and should be further confirmed.
A figure like (Figure 3), comparing the 40 - 60 cohort with data for AD patients born in the same time interval, for example, would indicate whether there is some population group with higher or lower risk for these diseases. The null hypothesis is that AD patients are randomly distributed inside the population. Thus, the density of AD patients inside a given cell should be proportional to the cell density, with the same proportionality constant for all cells. Deviations in a given cell would indicate an increased (or decreased) risk. Of course, the number of AD patients should be high enough to get relevant statistics. Population-wide studies are needed. A preliminary study is already running in Cuba [39]. On the other hand, as morphometric characteristics are more related to osseous structure than to expression or fat in the face, indications may be used as early risk markers for the younger groups in the same morphometric groups.
It is worth discussing recent findings on the genetics of face and brain shape [40]. Observed correlations are only between FM and brain shape, not affecting behavioral-cognitive traits, in particular AD. One could rephrase these results in a simplified way as the absence of genetic correlations between phase morphometry and early AD.
However, AD is mostly a disease of the elderly [41], essentially multifactorial. There are even hypotheses of infectious origin [42]. These multiple causes evolve against a given genetic background, which may accelerate or slow down the progression towards AD. By measuring correlations between FM and AD we would point out predisposition to AD in later stages of life.
Consider, for example, the more limited set of 51 genes identified in Ref. [43] as involved in facial development. We list them in (Table 1). From them, 16 genes have been related to AD. We indicate in the Table the links to relevant literature. Baseline alterations of these genes could provide a background for the evolution towards AD in later stages of life.
A recent surprising result [14] states that facial similarities are indications of more complete genetic similarities among individuals. Thus, our FM groups are probably groups of people with very similar genetic backgrounds, against which factors leading to AD evolve.
The risk for Parkinson’s disease or certain cancers could also be correlated to FM. We indicate also in (Table I) that 21 of these genes are related to cancer, according to Genecards [44]. An example is the Tumor Protein P63 gene (TP63), a member of the P53 family of transcription factors. At the mutational level, there is no correlation to cancer. Probably, a statement like that of Ref. [40] may be formulated: No direct genetic correlation between FM and early cancer. However, cancer is also a multi-factorial disease of the elderly, and the baseline expression level of mutated TP63 could be a factor facilitating evolution toward prostate cancer, for example. Indeed, we identified TP63 among the 33 most important expression markers in prostate cancer [45].
In order to check whether the expressions of the set of 51 genes play a role in defining the tumor state, we perform PCA calculations for TCGA expression data [46] in a set of tissues. Only the 51 genes in the set are used to conform to the PCA matrix. Details on methods can be found elsewhere [47]. In general, this limited set is able to discriminate between a normal tissue and a tumor in many tissues, with low confusion matrices. We show in (Figure 4) an example of perfect discrimination in Glioblastoma. This seemingly surprising result reinforces the idea that the baseline expressions of these genes could play a role in late cancer.
We reported preliminary results for the FM profile of the Cuban population and indicated that it could be used to find risk markers for groups inside the population for diseases such as AD, the idea behind this statement is the following. First, people inside an FM group share a considerable amount of genetic background. Second, multifactorial, late diseases such as AD develop against the genetic background of each individual, thus the background itself may be considered a risk factor. By comparing the density of AD patients with the population density one may, in principle, determine whether there are FM groups with predisposition to AD. As FM groups are based on osseous structure they could be used as early markers for younger groups with the same characteristics, allowing more detailed studies inside the group and early diagnosis. Early markers for other diseases like Parkinson’s and certain cancers could be obtained in the same way. A direct comparison between AD and population data is required in order to validate the statement.
The authors are grateful to Yasser Perera and Gabriel Gil for useful discussions. The authors acknowledge the Cuban Agency for Nuclear Energy and Advanced Technologies (AENTA) and the Office of External Activities of the Abdus Salam Centre for Theoretical Physics (ICTP) for support.
Funding: The work was self-funded by the authors.
Roberto Herrero: Data curation and processing, data visualization, formal analysis, investigation, methodology, and validation.
Yoanna Martinez: Data curation and processing, investigation, methodology, and validation. Heydi Mendez: Data curation and processing, supervision, writing- review and editing.
Joan Nieves: Data processing, data visualization, and formal analysis.
Augusto Gonzalez: Conceptualization, formal analysis, methodology, supervision, writing original draft, review, and editing.
Subscribe to our articles alerts and stay tuned.
PTZ: We're glad you're here. Please click "create a new query" if you are a new visitor to our website and need further information from us.
If you are already a member of our network and need to keep track of any developments regarding a question you have already submitted, click "take me to my Query."