Tags: ,

Mardi  18 juin 2019

Deep Understanding Vs Deep Learning - Automatic coding of pathology reports to standards requirements.

Deep Understanding is our label for processing that has its antecedents in Computational Linguistics traditions of  using gold standard annotation and computational lexicogrammatical semantics for feature creation of tokens before applying machine learning to assemble Language Model (LM). Deep Learning is the rise of text mining   to using neural nets for extensions of the bag of words language model into neural net vectorisation for token features. While the mechanisms are distinctly different the DL methods have captured the attention of data scientists around the world who now claim the mantle of performing Natural Language Processing (NLP).

In 2018 NAACCR introduced new requirements for the coding pathology reports for cancer registries.  Many new histology codes were introduced which require supplementary information e,g. Biomarker and Molecular results not held in the one report and so require a new strategy to analyse multiple documents to achieve the coding. Where Biomarker and Genetic results are in one report there is some coding functions performed when this information is separately demarcated, but when it is placed in the general text it is not always easy to correctly identify the content in an unambiguous manner.


These complexities have required a significant campaign by NAACCR and NCRA to train and up skill thousands of registrars across the USA and Canada in the revisions. 

This training of so many people in a rich and complex task has energised new attention to develop automatic methods of coding pathology reports.


The great advantage of an automatic method of coding is that once it has mastered the coding for a particular tumor stream it doesn’t forget that processing method. The disadvantage is that it can be

susceptible to errors as the variety of human language usage strays significantly away from the language used to build its model for the coding task.


A study using DL methods to code reports has been presented  by the USA Department of Energy which shows performance standards not uncommon in NLP research varying from 60-90% but not at an accuracy that would assist production line processing by registries. The equivalent figures produced by our company show better results all between 90-99%. These differences warranted an analysis of the differences in the methods of DU and DL. Our experimentation on a common task of document classification to discriminate cancer reports from non-cancer reports on a common test corpus demonstrated that the DU performed about 15% more accurately than the DL method.



Jon Patrick held the Chair of Information Systems at the University of Sydney from 1998 to 2004 and then moved to the Chair of Language Technology. In 2005 he won the Australia’s national Eureka Science prize for his work in natural language processing. He has conducted extensive research on the use of language technology in Intensive Care, Pathology and Radiology departments, and in information systems research in emergency medicine and oncology. In 2012 he left the University of Sydney to pursue his interests in R&D in Health IT and NLP and is the CEO for the companies Health Language Analytics (HLA) and  its subsidiary Health Language Analytics Global and Innovative Clinical Information Management Systems (iCIMS). His NLP companies hold contracts with the California Cancer Registry and the Centers for Disease Control, and the University of California and other hospitals and authorities. His clinical systems company has built over 30 different applications and particularly in the arena of cancer care and tumour board solutions. 


Misinformation & Miscommunication in social media

Date : 21 juin 2019 à 14:00 au LIMSI

Social media have become the default channel for people to access  information and express ideas and opinions. Unfortunately also to insult people also due to the fact that people hide themselves behind the relative anonymity of social media that facilitates the propagation of toxic, hate and exclusion messages, targetting categories of people on the basis of their gender (misoginy), race or religion (immigrants), etc. Moreover, social media foster information bubbles and every user may end up receiving only the information that matches her personal biases, beliefs, tastes and points of view. A perverse effect is that social media are a breeding ground for the propagation of fake news: when a piece of news matches with our beliefs or outrages us, we tend to share it without checking its veracity. Therefore, social media contribute, paradoxically, to the misinformation and polarization of society, as we have recently witnessed in the last presidential elections in USA, the Brexit referendum, or the Catalan issue. In this talk some of these problems will be illustrated and some examples of shared tasks addressing these problems will be discussed.


Paolo Rosso is full professor at the Universitat Politècnica de València, Spain where he is also member of the PRHLT research center. His research interests focus mainly on author profiling, irony detection, opinion spam detection, and plagiarism detection. Since 2009 he has been involved in the organisation of PAN benchmark activities at CLEF and at FIRE evaluation forums, mainly on plagiarism / text reuse detection and author profiling. At SemEval he has been co-organiser of shared tasks on sentiment analysis of figurative language in Twitter (2015), and on multilingual detection of hate speech against immigrants and women in Twitter (2019). He has been PI of national and also international research projects funded by EC and U.S. Army Research Office. Recently, in collaboration with Carnegie Mellon University, he is involved in a project funded by Qatar National Research Fund on author profiling for cyber-security, which aims to profile who is behind to threat messages. He serves as deputy steering committee chair for the CLEF conference and as associate editor for the Information Processing & Management journal. He has been chair of *SEM-2015, and organization chair of CERI-2012, CLEF-2013 and EACL-2017. He is the author of 200+ papers, published in journals, book chapters, conference and workshop proceedings.





Campus universitaire bât 507
Rue du Belvedère
F - 91405 Orsay cedex
Tél +33 (0) 1 69 15 80 15


Rapport scientifique


Le LIMSI en chiffres

8 équipes de recherche
100 chercheurs et enseignants-chercheurs
40 ingénieurs et techniciens
60 doctorants
70 stagiaires

 Université Paris-Sud nouvelle fenêtre


Paris-Saclay nouvelle fenêtre