Skip links

Automated AI approaches to clinical coding “A Case Study”


Clinical coding is an administrative process that involves the translation of the diagnostic data from an episodes of care into a standard code format. The clinical data sources includes (but not limited to)

  • Admission data
  • Discharge summaries
  • Pathology tests
  • Radiology tests
  • Pharmacy orders

In Figure 1, we show a simple graphical representation of how the coding process takes place. ICD-10 is the most commonly adopted coding standard all over the world. The abbreviation stands for the 10th revision of the International Classification of Diseases and Related Health Problems which is compiled by the World Health Organisation. It represents codes for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases.

Generally , each code has 7 characters where characters 1-3 define the category of the disease, characters 4-6 define the body site and severity, while the last character is an extension. Nevertheless, different countries may apply some variations to the code structure to suit their specific needs. In the near future, coding will become increasingly cumbersome due to the move towards the eleventh revision “ICD-11” of the codes. It is expected to exceed 55,000 codes with some new changes to Emergency and Mental Health documentation. Moreover, it is critical for States and Territories (and Health services) to understand early implications of the classification and thus influence the direction, and ultimately, adequate contributions to funding.

The two main use cases for clinical coding are:

Clinical coding process
Figure 1 Clinical coding process
  • Billing (Local & National Governments, Health Care & Insurance)
  • Reporting (Epidemiology research, Public policy, Epidemic Surveillance)

Clinical Coders thoroughly examine all medical records for an episode of care to identify:

  • The primary diagnosis
  • Any secondary treatments
  • Co-morbidities
  • Complications

All of which are mapped to the most specific ICD-10 codes.

The process is still done manually using books and coding manuals in most cases.

What seems to be the problem then?

The manual clinical coding process results in many difficulties that come around and collectively cause a lot of hassle to entities utilizing the codes. Here are some of these issues:

  1. Demanding Job: Clinical coding is a challenging job requiring so much knowledge
    • Medical & Anatomical terminology
    • Health data standards  & Classification conventions
    • Health information systems
  2. Experience: It requires around 4 years of experience on average for coders to become proficient and essential to the team
  3. Throughput: Coding of frequent & simple cases can be an average of 24 cases per 8h workday [1], and yet complex cases can take up to a few weeks
  4. Accuracy: Average accuracy levels hover around 70%-75% [2]. In Table 1, we show some statistics from a clinical coding contest run by the American Health Information Management Association (AHIMA). We can get a sense of how complex the process is even for experienced coders.
  5. Speed vs Accuracy Dilemma: The more and faster the coding is the lower the quality. On the other hand, the more time you spend coding the better the quality and hence the lower the overall productivity. 
  6. Staff Shortage: Only about 52% of clinical coders work full time [3], with around only 500 roles in Victoria alone [4]. As a result, some facilities have taken measures such as off-shoring to address the shortage with the increasing backlog
Central learning 2nd national ICD-10 coding contest
Table 1: Central learning 2nd national ICD-10 coding contest

What are the impacts of coding errors?

Each information source and each case has the potential for coding and classification errors. Coding errors can have severe impacts on many areas such as billing mistakes which cause underpayments [2,5,6]. Let us consider an example, for appendectomy, the most common emergency surgery performed in Australia [7], incomplete coding can have a substantial funding impact.

Example: Patient admitted with acute appendicitis. During the post-op the patient develops a wound infection and treated with IV antibiotics.

Billing impact of miscoding acute suppurative appendicitis
Table 2 Billing impact of miscoding acute suppurative appendicitis

From the example, we can easily see how the coding error can result in rework and/or eventually reduced funding. Another severe impact for miscoding is losing track of surveillance of trending epidemics [8-9].

What if we could use AI to undertake clinical coding?

If AI can drive a car as well as a human, could it code as well as a human?

There have been major breakthroughs in artificial intelligence (AI) based approaches in the last few years.

AI is the broad science of machines mimicking human abilities. It allows computers to learn from data without someone explicitly programming it to perform the requested tasks. AI makes it possible for machines to learn from experience. Machines can easily process more data and learn deeper latent relations (hard for humans to pick up) and eventually obtain higher levels of accuracy than humans. This process allows for the production of reliable and repeatable results for consistently confident decisions. Despite the many obstacles in the healthcare domain, AI can play a key role in the clinical coding arena and can present many tangible benefits.


  • Cost savings
  • Improved consistency
  • Overcome the problem of under-staffing
  • Provisional coding becomes feasible
  • Faster coding means faster payments (releasing capital)
  • Auditing becomes less cumbersome, wider coverage and accurate

General complexities of medical data

Many healthcare organizations do not impose strong processes to manage data quality -in a conceptually structured manner- especially for long-term use. Clean health records and downstream datasets can increasingly add better value over time. Even with the introduction of Electronic health records (EMRs), not many currently operating systems can not provide easy and realtime processing to the available data. Below is a summarized list of complexities surrounding the medial data:

  • Varied levels of EMR data quality
  • Low interoperability and complexity in clinical systems
  • Data collection, retrieval & wrangling is challenging
  • Dealing with missing & incomplete data
  • Data sampling & coverage (real case-mix )
  • Regulatory requirements/bureaucracy

Maharaj Nakorn Chiang Mai Hospital Case Study

Maharaj Nakorn Chiang Mai Hospital is a university teaching hospital, affiliated to the Faculty of Medicine of Chiang Mai University, located in Mueang Chiang Mai District, Chiang Mai Province. It is the first Thai Hospital established outside of Bangkok in 1941. I t is a fairly large Hospital with 1,400 patient beds, 69 ICU beds, 92 sub-ICU beds, and 28 operating rooms. They treat 45,000 of inpatient cases on average per year including over 1000 open heart surgeries and 40 kidney transplants per year. The hospital registers almost 1.3M patients in its outpatient clinics. One of the most interesting facts that motivated us to work on this case study is that the Chiang Mai Hospital their adopted Diagnosis Related Group (DRG) is very similar to the AR-DRG classification system. This means that we can apply our understanding and analysis to the Australian healthcare domain.

Data Complexity

Our data is collected from the Chiang Mai Hospital’s repositories for episodes of care recorded between 2006-2019. Table 3 highlights some of the important statistics to get an overview of the data complexity.

Table 3 Maharaj Nakhon Chiang Mai hospital dataset statistics

As our intention from this article is to introduce the data and solutions at a high level, we will not get very deep into very specific details. Thereby, we will just present the most important highlights and other details will appear in more technical articles to follow. Here are some of the interesting highlights of the data:

  • 42.5% of episodes of care have a unique code set (very few examples per label)
  • Inpatient episodes of care are dramatically more complex
  • Outpatient follow-up cases (we lack the patient history) is quite challenging
  • Long-tail of complexity with 100 codes appear in 70% the cases as shown in Figure 2
The percentage of occurrence of top frequent 30 ICD-10 codes  in the inpatient dataset
Figure 2 The percentage of occurrence of top frequent 30 ICD-10 codes in the inpatient dataset

In Figure 2, we show a graphical representation to the long tail problem with the top 30 most frequent ICD-10 codes for the inpatient dataset. We can see that the majority of codes appear consistently rarely. In this case, machine learning becomes increasingly harder. As less frequent cases will get the less likely probability to occur by the models which can easily drive the production of biased models.

How to handle each data source?

Each data source comes with a different format, type and complexity. Therefore, there is a huge challenge in the pre-processing which is the most critical in providing meaningful predictive signals. Similarly, the processing and modelling phase yet to deal with another set of harder challenges as we will show later.

Table 4: Data sources details & challenges

Extensive preprocessing was performed on different data sources. For instance, we deal with non-structured text (e.g. radiology reports), semi-structured laboratory data ( different formats including text, numeric and mixture), sparse structured drug orders and the tabular admission data.

Challenges of Automation

In the light of complexities surrounding the data in our case study shown above, many challenges arise that face the automation of the clinical coding problem. Below is a summarized list of such complexities:

  • Extreme Multi-label Classification (12,000 unique labels)
  • Lack of accurate ground truth coding (gold standard)
  • Scarcity of publicly available datasets (Few e.g. MIMC with only 50,000 of ICU patients)
  • Dealing with imbalanced data (many complex and rare -> real-life case-mix)
  • How to best combine knowledge from multiple data sources (feature fusion)

Suitability of deep learning (AI)

In our quest for the best modelling approach to tackle the automation to clinical coding, we choose to adopt deep learning based techniques. Deep learning is family of machine learning methods based on neural networks. It has very high representation learning capabilities. It imitates how our human brain works to solve problems, by passing queries through various hierarchies of concepts and related questions to find an answer. Lately, it has been adopted to solve many complex problems such as Image Processing & Computer Vision, Natural Language Processing (NLP), Machine Translation, Self Driving Cars, Fraud Detection and many more. We can summarise the applicability reasons to adopting deep learning as follows:

  • The Inherent high complexity & non-linearity embedded in the clinical coding problem
  • It can automatically learn hidden intrinsic complex features that are hard to discover manually
  • It can easily crunch huge amounts of data distilling priceless knowledge
  • Current availability of adequate training infrastructures (on premises or on cloud)

Modelling Architectures

In this section, we give a brief overview of some of our explored design architecture for building ICD-10 predictive models. First, we formulate the problem as a multi-label classification problem for predicting ICD-10 codes for episodes of care. We adopt the Feed-forward Neural Network architectures [10] to predict probabilities for each ICD-10 code. Later, collectively decide on the resulting set of the predicted ICD-10 codes according to higher probability values of each. The details of all models are expected to appear in our academic papers to appear soon.

One very intuitive modelling architecture is to stack all available evidence presented by the different data sources and train one network. This way all of the interactions between the different evidence can be captured along with their relation towards the final diagnosis. We refer to this modelling architecture as combined model which we will refer to in the results section.

In Figure 3 we show the graphical structure of the combined model. Given we have multiple data sources, it is believed that this architecture is not going to be the best one. This can be attributed to the varied complexity of the data sources which results in building an over complicated network going over multiple iterations of hyper-parameter fine-tuning and experimenting with various number of layers and loss functions. This complexity results in model saturation very early in the training process. Consequently, this leads to the under-exploration of the different data modalities.

Figure 3: Combined model structure

Our second modelling architecture where separate networks are trained on individual sources as shown in Figure 4. Later the predictions of each are aggregated using average or weighted average techniques. This way, representative or shallower data representations from the different data sources will not dominate the feature space in the training process. However, directly choosing one source based on late fusion of the knowledge after reaching a conclusion from each data source is found to be less informative in making the final decision.

Figure 4: Averaging model structure

This leads to our ensemble modeling architecture as shown in Figure 5. We need to balance our model design to fully explore the different data modalities with its various levels of complexities and accurately learning the interrelationships between them. We have built an expert network on top of the individually trained models that we refer to as an “Ensemble or Expert network”. This network mimics the actual operation of human clinical coders being introduced to all sorts of clinical evidence and then decides how best that can be translated to final diagnosis.

In fact, even better than looking at individual raw data sources, the network will learn from dealing with experts represented by the trained networks which we can hypothetically think of as (pathologist, radiologist, pharmacist). Iteratively with time, the ensemble network will be able to know the expertise of each specialist or expert in deciding on the diagnosis. Furthermore, it can formulate new diagnosis based on the combination of given predictions by individual networks not just consider the highest weighted prediction from one source.

Figure 5: Our ensemble model structure

Preliminary Results

In this section, we present the evaluation measures used in empirically quantifying the accuracy of our models along with the experimental results.

Evaluation Measures

Unlike binary/multi-class classification, assessing the performance of multi-label classification is more about which combinations of labels are right. We need to inspect the results from many angles to identify different error cases including under coding and over coding to understand how the model is behaving in response to different cases. Thus, we adopt the following evaluation measures:  

  • Average precision: Summarizes a precision-recall curve as the weighted mean of precision achieved at each threshold
  • Coverage error: Computes how far we need to go through the ranked scores to cover all true labels
  • Ranking loss: Computes the average number of label pairs that are incorrectly ordered given y_score weighted by the size of the label set and the number of labels, not in the label set
  • F1 score: Measures the weighted average of the precision (not to label as positive a negative sample) and recall (detect all the positive samples correctly)
  • Jaccard similarity: Measures similarity between two sample sets as the intersection divided by the size of the union of the sample sets
  • Accuracy: Measures #correct/#samples only used for primary diagnosis


In Table 5, we show how our model consistently performs better on all evaluation metrics. We can witness an improvements of 4-5% on inpatient dataset while 2-3% on outpatient. We also found that the different data sources do not contribute equally to accuracy. For instance prescriptions data are often more informative. As previously suspected, each source were found to converge at different rates with different number of iterations and model complexities. Deep networks can quickly find an optimal minimum from some data modalities faster than others. That is why, our successful setting of learning each modality separately fully encodes the variability levels of data complexities leading to better accuracy.

On the other hand, our model could successfully obtain human like performance in prediction primary diagnosis especially for inpatient data. This is an assuring figure for many critical applications of the clinical coding. For instance, medical billing which relies essentially on the identification of the primary diagnosis.

Table 5: Automatic coding accuracy

In Table 6, we show the top ranked 5 disease categories ranked by accuracy. We can see the top 3 categories of the inpatient dataset the accuracy surpasses 90% which is very promising. We can also see that for Neoplasm related cases that make ~30% of the data, our model obtain 80% accuracy which is fairly encouraging. Even though the model performance towards outpatient cases is of lower figures, it is still over 60% on average (on ~65% of the data) which is great first step towards more improvements.

Table 6: Accuracy of top 5 high-level diagnosis categories

Self-aware Model Performance

Traditionally, machine learning models are built and evaluated during the train/evaluate process. The most popular method is by checking the performance on a hold-out dataset. However, moving on to real-time prediction, we do not generally have a way to evaluate the current predictions – as there is no hold-out set. To combat this problem, we would like to have a metric for how “confident” the model is in its prediction. For example, it would be very useful for us to know that the model has a high degree of confidence over simple, well understood episodes of care, but poor confidence over a complex case. This would enable the model to trigger a process enabling a human to review the case.

We propose a confidence assessment model to be coupled with our ICD-10 prediction model. In Figure 6, we show a graphical illustration to our confidence assessment network. We train the model on the discrepancy between predictions and actual codes taking into account all input evidence. This way the model will be able to assess the predictions given how the inputs look like giving it a sense of case complexity and how likely the predictions are good or bad.

Figure 6: Our confidence assessment model structure

In Table 7, we present how successful our confidence assessment network tested on different data scopes (%) of the dataset. This way we provide a confidence score coupled with each prediction. For instance, we can see that on 3% of the cases found in the dataset, we are 97% confident in our predictions which is correct. You can see that on 50% of the cases the model showed confidence levels of 85%. In fact, this is very convenient in automating the process of call for help when needed. This way model is self-aware and can be easily run and evaluated in real-time by users at all times.

Table 7: confidence rates accuracy versus data scopes


  • Ensemble modelling with expert network deciding on best prediction is proven to outperform other modeling approaches
  • Prescriptions data was found to be the most informative as it usually occur late in the diagnosis cycle
  • Pathology & Radiology data sources add up to 4% to accuracy
  • Patient demographics (ruling out age & gender based diseases) along with seasonality encoding (identifying time related infectious diseases) add up to 1% accuracy
  • We have a system with human comparable performance (inpatient)
  • Further work is still underway to leverage more knowledge to tackle (outpatient) accuracy
  • In 50% of the cases, we have an 80% confidence  in our predictions (call human when needed)
  • No discharge notes included our modeling which means coding prediction can happen at all points from admission to discharge with every new evidence added incrementally
  • Adding the notes is part of the future work and is believed to push for higher accuracy levels

Potential Downstream Applications

Our preliminary successful endeavors to automate the complex process of clinical coding has paved the way for many interesting applications to support the healthcare community. There are numerous applications to coding automation such as real-time insightful analytics, provisional cost predictions, provisional logistic & staffing planning etc. However, at this point, we propose some direct practically applicable solutions utilizing coding predictions:

Decision Support Systems

One very direct and useful downstream application to the automation of the clinical coding is to build a decision support system on top of the predictive models. Below are some of the most significant capabilities:

  • Building software tools that support clinical coders for faster coding
  • Recommending codes for an episode of care along with a confidence rate in the predictions
  • The clinical coder can act as a QA review of the machine-generated codes
  • Clinical coder’s input also re-trains the model and accumulate knowledge and incrementally become smarter with time
  • Real-time tracking and prediction of costs & billing and epidemiology trends & evolution

Automatic Auditing

Clinical coding auditing is the critical process of ensuring that the coding is performed correctly and aligns with the specified guidelines. The outcome of the audit helps in analysing and reporting of the current performance of the healthcare facilities and developing strategies for dealing with under-performing ones. There is an increasing emphasis, both locally and internationally, on the need for high quality, accurate auditing strategies. However, the process is currently being done manually which makes it bound to the various levels of human expertise and is subject to high error rates. Subsequently, leading to disastrous decisions and strategies on both short and long terms. Automated coding can be extra valuable in this area assisting with:

  • Performing routinely recurrent audits
  • Capturing accuracy and productivity
  • Identifying suspicious patterns and trends
  • Understand coding behaviour and compliance of different coders
  • Recognizing deficient areas where training and development needed


In this article, we wanted to draw much-needed attention to the clinical coding practice in the health care domain and how applicable automation can fit this area. We presented a number of modelling architectures. Our novel ensemble deep learning model shows that it fits best the problem. Interestingly, it could successfully harness enough knowledge from the multiple clinical data sources to perform the task – meaning that the approach could be made even more accurate through the incorporation of additional data sets. We consume, process and model data from different modalities including unstructured, semi-structured and structured tabular data. Due to the sensitivity of miss predictions in this area, we further propose an automatic confidence assessment model that produces a confidence rate in our predictions in real-time.

We thoroughly evaluated our models quantitatively on the Maharaj Nakorn Chiang Mai Hospital dataset showing the great potential of our models for adoption in real clinical coding practice. One extra benefit is that we trained our models without the knowledge of discharge summaries. Consequently, this gives us an edge of continuous & incremental prediction of ICD-10 codes with new supply of clinical evidence until discharge. Actually, this allows for real-time reporting for the ongoing diagnosis in hospitals real time. The positive results have just opened the door to further future work that has started already. We can apply our models in real-time and continuously training the models on the fly with new arriving cases.

What’s next?

Where we are now

We just scratched the surface of automation in the clinical coding area opening new horizons for productizing such service to a wide range of health care providers. There are still a large number of non-technical issues which need to be addressed, including in Data & Machine Learning Governance, regulatory challenges and changes to workflows.

We are most interested in incorporating this research into decision support systems, in order to prove the benefits and integrate the solution into current systems and processes. We would like to partner with Australian health care services and hospitals on this journey to help enhance and automate an important area in the healthcare domain.

Please contact Terence Siganakis if you are interested in partnering with us to evaluate the technology on your data sets.


Below are videos to our latest research work on ICD10 coding automation accepted and presented at the Australian Digital Health Institute summit (DHIS) November 2020


[1] C. Smith, S. Bowman, J. A. Dooling,”Measuring and Benchmarking Coding Productivity: A Decade of AHIMA Leadership”, (2019)

[2] K. Charland “Measuring Coding Accuracy and Productivity in Today’s Value-Based Payment World”, Journal of AHIMA, (2017)

[3] Australian Institute of Health and Welfare, “The coding workforce shortfall”, (2011)


[5] S. A. Zafirah, A. Nur, S. Ezat, W. Puteh and S. Aljunid , Potential loss of revenue due to errors in clinical coding during the implementation of the Malaysia diagnosis related group (MY-DRG) Casemix system in a teaching hospital in Malaysia, BMC Health Services Research (2018)

[6] H.A. Khwaja H. Syed D.W. Cranston, Coding errors: a comparative analysis of hospital and prospectively collected departmental data,BJU international,(2002)

[7] The Second Australian Atlas of Healthcare Variation

[8] P. Cheng, A. Gilchrist, K. Robinson and L. Paul, The risk and consequences of clinical miscoding due to inadequate medical documentation: a case study of the impact on health services funding, NCPI, (2009)

[9] L. Knight, R. Halech, C. Martin and L. Mortimer, Impact of changes in diabetes coding on Queensland hospital principal diagnosis morbidity data, Technical Report  Queensland Government, (2011)

[10] D. Svozil, V. Kvasnicka, and J. Pospichal, J. Introduction to multi-layer feed-forward neural networks. Chemometrics and intelligent laboratory systems39(1), 43-62, (1997).