|Year : 2022 | Volume
| Issue : 1 | Page : 32-38
Performances of depression detection through deep learning-based natural language processing to mandarin chinese medical records: Comparison between civilian and military populations
Tai- Yu Chen M.D , Hsuan- Te Chu M.D , Yueh- Ming Tai M.D , Szu- Nian Yang M.D., Ph.D
Department of Psychiatry, Beitou Branch, Tri-Service General Hospital, National Defense Medical Center; Military Suicide Prevention Center, Taipei, Taiwan
|Date of Submission||19-Nov-2021|
|Date of Decision||12-Dec-2021|
|Date of Acceptance||14-Dec-2021|
|Date of Web Publication||26-Mar-2022|
Yueh- Ming Tai
No. 60, Shin-Ming Road, Beitou District, Taipei 112
Source of Support: None, Conflict of Interest: None
Objective: A certain portion of patients with depression is under-diagnosed and has attracted the attention in the field of natural language processing (NLP). In this study, we intended to explore the feasibility of transferring unstructured textual records into a screening tool to early detect depression. Methods: We recruited 22,355 medical records in Mandarin traditional Chinese from the psychiatry emergency department of a military psychiatry center from 2004 to 2019. We preprocessed all the context of present illness histories as corpus and the presence of clinical diagnoses of depression as an outcome. A state-of-the-art NLP model was developed based on a pretrained bidirectional encoder representation from transformers (BERT) model along with several convolutional neural network (CNN) and trained by the training set (80% of original data) of total samples (BERTgeneral) and of civilian samples (BERTcivilian) and of military samples (BERTmilitary) independently. The receiver operating characteristic (ROC) and area under curve (AUC) of three trained models were compared for predicting depression for the test dataset (20% of original data) of general and specific samples. Results: The experimental results demonstrated excellent performance of BERTgeneral for general samples (AUC = 0.93, sensitivity = 0.817, specificity = 0.920 for optimal cut-off point) and civilian sample (AUC = 0.91, sensitivity = 0.851, specificity = 0.851 for optimal cut-off point). BERTgeneral showed a significant underperformance of for military samples (AUC = 0.79, sensitivity = 0.712, specificity = 0.732, p < 0.05 for optimal cut-off point). That of BERTmilitary was slight higher (AUC = 0.82, sensitivity = 0.708, specificity = 0.786 for optimal cut-off point) for military samples. Conclusion: This study showed the feasibility of applying deep learning technique as a depression-detection assistant tool in Mandarin Chinese medical records. However, the subjects' specific situation, e.g., military status, is warranted for further investigation.
Keywords: area under curve (AUC), artificial intelligence, bidirectional decoder representations from transformers, receiver operating characteristic (ROC)
|How to cite this article:|
Chen TY, Chu HT, Tai YM, Yang SN. Performances of depression detection through deep learning-based natural language processing to mandarin chinese medical records: Comparison between civilian and military populations. Taiwan J Psychiatry 2022;36:32-8
|How to cite this URL:|
Chen TY, Chu HT, Tai YM, Yang SN. Performances of depression detection through deep learning-based natural language processing to mandarin chinese medical records: Comparison between civilian and military populations. Taiwan J Psychiatry [serial online] 2022 [cited 2022 Aug 18];36:32-8. Available from: http://www.e-tjp.org/text.asp?2022/36/1/32/341041
| Introduction|| |
In the world, depression has impaired more than 260 million people . Their impairments include functioning impairment, higher suicide risk, as well as psychiatric and physiologic comorbidities. Furthermore, many individuals with depression do not receive adequate diagnoses and treatment. Based on a meta-analysis data, about 50% patients in primary care with depression have been diagnosed and only 15% of them received treatment .
Depression is also a growing problem in the military. In the U.S. military, the prevalence of depression has been increased since 2007 in all services . Based on a survey for Canadian military personnel, those with depression in the previous year were found that only about three-quarters of them have sought mental health care, and that 70% of them have been seen by a psychologist or a psychiatrist .
At present, the applications of state-of-the-art artificial intelligent technique succeed in many medical fields, e.g., determining Alzheimer's disease through neuroimaging data , identifying attention-deficit/hyperactivity disorder through fMRI images , and detecting depression through electroencephalographic information . One study of postpartum depression-detection models with clinic observation measurements from a cohort of 508 women during their pregnancy period, the best performance has been found as an area under curve (AUC), being 0.78 . In the field of language understanding through machine learning, the natural language processing (NLP) is used in unstructured texts. For example, Eichstaedt et al. reported that using a depression-detection model trained through 683 Facebook postings of whom has a diagnosis of depression, yielding an accuracy rate as an AUC being 0.72 . An NLP model detecting first-time suicide attempt has shown an accuracy rate as AUC = 0.932 through using 45,000 medical records in English . Su et al. reported a depression detection model, using long short-term memory and trained using 1538 elderly patients' medical records in simplified Chinese, an accuracy rate as AUC is 0.629 .
At present, NLP has been shifting toward a pretrained deep learning-based language representation models such as bidirectional encoder representations from transformers (BERT) . The excellent performance of BERT model has outperformed other NLP models  such as in screening psychiatric patients using clinic notes , as well as identifying antidepressant treatment response  and clinic free text classification . In Taiwan, Dai et al. developed a series of deep learning-based NLP models for screening five psychiatric illnesses using discharge summaries in English, showing excellent performances of BERT-based convolutional neural network (CNN) models for the detection of depression . Accordingly, a depression-detection NLP model based on BERT and CNN is also feasible using medical textual data in Mandarin or traditional Chinese.
Regarding relevant studies in military, a study analyzed the statistic model using a questionnaire, Center for Epidemiologic Studies-Depression Scale (CES-D) for depression screening in U.S. military service members with mild traumatic brain injury and has shown an accuracy as AUC = 0.897 . In another study, investigators used machine learning algorithms to detect self-harm and suicide ideation using 102 variables from questionnaires on 738 veterans and soldiers, showing an accuracy as AUC = 0.827 . To our knowledge, studies with NLP model for military population are still lacking. In this study, we intended to explore the feasibility of deep learning-based NLP models for determining depression in Mandarin Chinese medical records. Considering differences in depression between civilian and military populations, we in this study also compared performance of general and specific models between civilian and military populations.
| Methods|| |
The institutional review board at the Tri-Service General Hospital, National Defense Medical Center in Taipei, Taiwan, approved the study (protocol number = TSGHIRB No. B202005103, and date of approval = June 4, 2021) without the need of obtaining informed consent from the study participants whose electronic medical records were used.
The corpus of this study contains 22,355 electronic medical records, in Mandarin Chinese, of visits in psychiatry emergency department from one military psychiatry hospital in northern Taiwan from 2004 to 2019. After excluding those who with uncertain diagnoses and with missing information, we divided participants into civilian samples and military samples according to their status at the time of their visits in emergency department. Although the medical records were organized into several sections, e.g., mental status examination, laboratory data, etc., only the texts in “present history” section and the “diagnosis” section were extracted and used for further analyses. All diagnoses had been proven by at least one board-certified clinical psychiatrists during the period of management or treatment in emergency department. We encoded all depression-related diagnoses, namely major depressive disorder, dysthymic disorder, or double depression, as positive outcome. All other personal information of the patients was not identified.
Using Python programing language (version 3.8)  and related tools, the corpus, unstructured texts of present illness history, and targets, depression outcome were first divided into civilian and military groups. Next, each group was randomly divided into training and text sets into a given ratio of 80: 20, respectively.
All of the corpus, unstructured texts in Mandarin Chinese and a small portion of English terms, were processed by following steps ([Figure 1] part a): (a) converting characters into tokens, (b) assigning tokens a digital numbers (ids), (c) trimming sequences of ids into fixed length (512 in this study), and (d) padding a special token for those sequences less than 512 in length.
|Figure 1: The concept diagrams of data preprocessing (a) and deep learning model establishment (b). The embedding layer is a pretraining TensorFlow Bidirectional Encoder Representations from Transformers model, bert_zh_L-12_H-768_A-12_2, downloaded from the TensorFlow Hub (www/tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/2); Conv, Convolutional layer|
Click here to view
Inspired by Dai et al. , a series of pretrained BERT along with text-CNN models ([Figure 1] part b) were implemented in this study because of its outstanding performance . The final outputs of models were the log-probability of the depression diagnosis produced by sigmoid function.
Baseline model: Bidirectional encoder representation from transformers-based model fine-tuning
BERT  is a pretrained deep learning model that learns bidirectional representations from large unlabeled text jointly conditioning on both the left and right context . The pretrained model, bert_zh_l_12_H-768_A-12_2, was download from TensorFlow Hub (www.tfhub.dev/tensorflow/bert_zh_L-12_H-768_ A-12/2) which has been developed by Google Brain Team for special data in Chinese language. The BERT models in this study were fine-tuned using a sequence of 512 tokens mentioned above as input dataset (the embedding layer in part B of [Figure 1]).
Textual convolutional neural network
CNNs have been flourishing and succeeding in image classification and recognition tasks lately. The text-CNN model in this study applying three convolutions to the represented sequences from the BERT model mentioned above. Each convolution can be considered as a feature extractor that uses a kernel, or window, to capture implicit linguistic properties embedded in input texts. In our implementation, context windows with lengths of 2, 3, 4 were included to capture the bigram, trigram, four-gram, respectively. The max-pooling layers follow the convolutional layers to obtain the maximum activation ([Figure 1] part b). The combination of the above layers enables us to capture the most prominent features for determining the diagnosis of depression. According to the training sets used through model training, namely total samples, civilian samples, and military samples, we developed three independent models, namely BERTgeneral, BERTcivilian, and BERTmilitary, respectively.
We used the descriptive statistics and the comparisons of demographic characteristics between civilian and military samples with t-test for continuous variables and Chi-squared test for dichotomous or categorical ones. To present the diagnosis-detection performance of machine learning models, we calculated the optimal cut-off points of receiver operating characteristic (ROC) curves according the maximal sum of sensitivity and specificity of the predicting probabilities of test sets. In addition, the magnitudes of their AUC were calculated by R statistic language . We also used DeLong test to compare AUCs of two models.
All above variables were computed using the Statistical Package for the Social Science version 25 for Windows (SPSS Inc., Chicago, Illinois, USA). The differences between the groups were considered significant if p-values were smaller than 0.05.
| Results|| |
In this study, a total of 22,355 unstructured texts from the present psychiatry illness histories in Mandarin Chinese were recruited from a military psychiatry hospital from 2004 to 2019. [Table 1] depicts the demographic characteristics and subjects' diagnoses relevant to depression, as well as the characteristic differences between military and civilian groups. [Table 2] presents the comparisons between training and test sets in both groups.
|Table 1: Baseline demographic characteristics of military and civilian samples|
Click here to view
|Table 2: The comparisons of training and test sets of military and civilian samples. These two sets were randomly divided into a given ratio as 80:20|
Click here to view
[Figure 2] illustrates the performances of BERTgeneral model for the determination of depression diagnosis of the test sets of total samples (the gray line), civilian samples (the dotted line), and military samples (the black line). [Figure 3] illustrates the performances of BERTgeneral model (the gray line) and BERTcivilian model (the black line) in determining depression diagnoses of civilian samples. [Figure 4] illustrates the performances of BERTgeneral model (the gray line) and BERTmilitary model (the black line) model in determining depression diagnoses of military samples.
|Figure 2: The receiver operating characteristic curve curves and optimist cut-off points (and sensitivity/specificity) of the BERTgeneral model which was trained by total training samples in prediction the depression diagnoses|
Click here to view
|Figure 3: Based on civilian samples only, the receiver operating characteristic curves and optimist cut-off points (and sensitivity/specificity) of the BERTgeneral model which was trained by total training samples (the gray line) and the BERTcivilian model which was trained by civilian training samples (the black line) in prediction the depression diagnoses|
Click here to view
|Figure 4: Based on military samples only, the receiver operating characteristic curves and optimist cut-off points (and sensitivity/specificity) of the BERTgeneral model which was trained by total training samples (the gray line) and the BERTmilitary model which was trained by military training samples (the black line) in prediction the depression diagnoses|
Click here to view
| Discussions|| |
In line with literature, this study demonstrates the excellent performance of artificial intelligence-driven language understanding techniques in detecting depression diagnoses through learning from unstructured texts in medical records in Mandarin Chinese. The combination of two state-of-the-art machine learning models, BERT and CNN, presented an acceptable performance with sensitivity as 0.817 ± I0.003 and specificity as 0.920 ± 0.002 at the optimal cut-off point ( 0.17) on the ROC curve while testing the general test set (AUC = 0.93). Those findings outperforms Eichstaedt's depression-prediction model for the elderly in China (AUC = 0.63)  but is close to Tsui's first-time suicide attempt-prediction model for English medical records (AUC = 0.932) .
In the special stressful circumstances preceding suicidality and depression of military population , the discrepancy existed in demographic characteristics, e.g., age and sex [Table 1] between civilian and military samples. We found that significant difference existed in depression-detection performances of BERTgeneral model in civilian samples (AUC = 0.91, [Figure 3]) and in military samples (AUC = 0.79, DeLong test, D = 2.04, p < 0.05). Comparing with Kenneddy et al.'s model (AUC = 0.897) for CES-D scale , and Colic et al.'s model (AUC = 0.827) for 102 variables  from military population, our BERTgeneral model is underperformed. A specific BERTmilitary model solely trained by military training set showed a better performance (AUC = 0.82) but failed to reach the statistic significant threshold from that of BERTgeneral model (DeLong test, D=1.59, no significance, [Figure 4]).
On the other hand, a civilian-specific BERTcivilian model trained by solely civilian samples did not have better performance while tested by civilian test set (AUC = 0.90, DeLong test: D = 0.004, no significance, [Figure 3]).
The clinic implication of our study findings is to encourage a future possible depression-diagnosis assisting tools for either some special situations where the diagnosis of depression is suspected but experts are unavailable or patients with depression are still under-diagnosed, for example, civilians in community or active duty service members in military. Practically, for possible depression diagnoses, our model predicts well in community population (sensitivity = 85%, specificity = 85%) but with unsatisfied performance for military population in Taiwan (sensitivity = 71%, specificity = 78%). Therefore, the application still needs further exploration in the future.
Readers are warned against not to over-interpret the study results because this study has six limitations:
- The most concerned issue is the length limitation of texts used in machine learning model. That is all the rest contains in the present illness history were not been taken advantage well. Moreover, in the case that two histories with only difference in contexts after 512 words well have the same result by our models.
- Another critical drawback of our study is the lacking of noise-suppression or noise-cleaning mechanism in the model architecture or training. Our models cannot recognize texts unrelated with target illness, depression in this study, e.g., unrelated texts from families or accompanies.
- The superior performances of our models might be partially due to the huge training set than other studies, and our models did not include some important demographic information and/or medication treatment that might have been essential for depression detection.
- The relatively unsatisfactory performance of our model among military samples might indicate some inevitable limitations of our data. In general, the prevalence of depression is higher in females than that in males. This phenomenon does not present in our military samples probably due to the imbalance gender ratio in Taiwanese army and some specific characteristics among female military population from female community samples.
- Civilian individuals had 1.85 medical records per person in those 16 years, that is obvious higher than military samples (1.22 medical records per person, [Table 1]).
- This discrepancy reminds us that the limited active service duration of soldiers can be another chronological impact on our analyses. Furthermore, complex deep learning models with more than one inputs are warranted to enhance the depression-detection performance especially for military population.
Based on the unstructured texts in present illness history of psychiatry medical records in Mandarin Chinese, we developed artificial intelligence-based language understanding models to detect subjects' diagnoses relevant to depression. The results showed that our models outperformed in civilian samples but underperformed in military samples, even we retrained using solely military training set. But in this study, we reconfirmed the feasibility of the depression diagnosis-assisted tools with machine learning technique in clinic. Our findings also remind that clinicians should be aware of the underperformance of the AI diagnosis-assisted diagnoses in subjects having military stressful circumstances.
| Acknowledgments|| |
The authors thank for the contribution from all study staff, especially Chung-Yuan Cheng and contribution of the psychiatry medical records from 2004 to 2019 from study patients.
All the opinions expressed in this article are the authors' personal opinions. They are unnecessarily reflecting on those of their hospital or institution. The datasets analyzed in this article are not publicly available. Any special requests to access the datasets should be addressed to YM Tai, the corresponding author of this article.
| Financial Support and Sponsorship|| |
This study received the financial support of the Military Suicide Prevention Center (MSPC) of Taiwan (TSGH-BT_D_110008).
[TAG:2]Conflicts of Interest[/TAG:2]
Yueh-Ming Tai, an executive editorial board member at Taiwan Journal of Psychiatry, had no rôle in the peer review process of or decision to publish this article. The other authors declared no conflicts of interest in writing this paper.
| References|| |
World Health Organization: Fact Sheet on Depression 2018
. Geneva, Switzerland: World Health Organization, 2018.
Mitchell AJ, Vaze A, Rao S: Clinical diagnosis of depression in primary care: A meta-analysis. Lancet
2009; 374: 609-19.
Greenberg J, Tesfazion AA, Robinson CS: Screening, diagnosis, and treatment of depression. Mil Med
2012; 177: 60-6.
Thériault FL, Garber BG, Momoli F, et al.: Mental health service utilization in depressed Canadian armed forces personnel. Can J Psychiatry
2019; 64: 59-67.
Ahmad I, Pothuganti K: Analysis of different convolution neural network models to diagnose Alzheimer's disease. Mater Today Proc
2020; 37: 2800-12.
Ariyarathne G, De Silva S, Dayarathna S, et al.: ADHD identification using convolutional neural network with seed-based approach for fMRI data. In: Proceedings of the 2020 9
th International Conference on Software and Computer Applications
. North Chicago, Illinois, USA 2020.
Seal A, Bajpai R, Agnihotri J, et al.: DeprNet: A deep convolution neural network framework for detecting depression using EEG. IEEE Trans Instrum Meas
2021; 70: 1-13.
Zhang W, Liu H, Silenzio VM, et al.: Machine learning models for the prediction of postpartum depression: Application and comparison based on a cohort study. JMIR Med Inform
2020; 8: e15516.
Eichstaedt JC, Smith RJ, Merchant RM, et al.: Facebook language predicts depression in medical records. Proc Natl Acad Sci
2018; 115: 11203-8.
Tsui FR, Shi L, Ruiz V, et al.: Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts. JAMIA Open
2021; 4: ooab011.
Su D, Zhang X, He K, et al.: Use of machine learning approach to predict depression in the elderly in China: A longitudinal study. J Affect Disord
2021; 282: 289-98.
Devlin J, Chang MW, Lee K, Toutanova K: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv
2018; 1810.04805v2 [cs.CL].
Peng Y, Yan S, Lu Z: Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. arXiv
2019; 1906.05474v2 [cs.CL].
Dai HJ, Su CH, Lee YQ, Das S, Blacker D, Smoller JW: Deep learning-based natural language processing for screening psychiatric patients. Front Psychiatry
2020; 11: 533949.
Sheu YH, Magdamo C, Miller M, Das S, Blacker D, Smoller JW: Phenotyping antidepressant treatment response with deep learning in electronic health records. medRxiv
Tang M, Gandhi P, Kabir MA, Zou C, Blakey J, Luo X: Progress notes classification and keyword extraction using attention-based deep learning models with BERT. arXiv
2019; 1910.05786v2 [cs.CL].
Kennedy JE, Reid MW, Lu LH, et al.: Validity of the CES-D for depression screening in military service members with a history of mild traumatic brain injury. Brain Inj
2019; 33: 932-40.
Colic S, He JC, Richardson JD, et al.: A machine learning approach to identification of self-harm and suicidal ideation among military and police veterans. J Mil Veteran Fam Health
2021; 8: e20210035-43.
Van Rossum G, Drake FL: Python Reference Manual
. Scotts Valley, California, the USA: iUniverse Indiana, 2000.
Vaswani A, Shazeer N, Parmar N, et al.: Advances in neural information processing systems. Proc Mach Learn Res
2017; 30: 5998-6008.
Core Team R: A Language and Environment for Statistical Computing
. Vienna, Austria: R Foundation for Statistical Computing, 2013.
Skopp NA, Holland KM, Logan JE, et al.: Circumstances preceding suicide in US soldiers: A qualitative analysis of narrative data. Psychol Serv
2019; 16: 302-10.
[Figure 1], [Figure 2], [Figure 3], [Figure 4]
[Table 1], [Table 2]