Automated industry classification system for business data company Herold
As part of my last projects for an Austrian business data company I developed an automated industry classification system. One of the main problems of the company's database was that the categories and the taxonomy of the industry groups were messy and unstructured. Additionally, for a large chunk of the data the categories were missing. So, the goal of the project was to automatically classify business data into their respective industry categories and to automated the process. As can be seen in the figure below, the high-level process of the final application consists of five main steps:
- In the first step the system gets fed either by a URL of a firm website or by a list of website URLs.
- In the second step the system extracts the statistically most relevant keywords from the website.
- In the third step a machine learning algorithm, to be more specific a neural network, classifies the extracted keywords into the respected industry category.
- In the fourth and last step the classification output is saved in the database.
Some additional features of the classification system:
- The system was developed by using Python 3.9 and implemented as an MS Azure function that can be accessed by a URL.
- As machine learning model I used a multi-layered recurrent neural network based on a LSTM vectorizer and implemented by KERAS/Tensorflow.
- I designed a new industry taxonomy consisting of 35 top-level categories, 230 mid-level categories and 3,200 sub-sub-level categories.
- For the web crawler, which extracts the statistically most relevant keywords from the website, I developed an algorithm that is capable of tagging words based on their word type.