<- Back

Developing a neural network based on Python and Keras for automated industry classification


In this post I will show how to develop and implement a neural network by using Python and Keras which is based on Google’s machine learning platform Tensorflow. The neural network was part of an old project, in which the system extracts the statistically most relevant keywords from a website and classifies them into the respective industry category. In this post the model will be trained by a dataset with 16,000 firms and their top website keywords as input variables and more than 30 different industry categories as the output variable.

While machine learning algorithms are based on heavy mathematical theory, the process of developing, training, testing and applying a model is, in fact, an experimental process. With each round we change the parameters and test which model works best. The development and implementation of most machine learning algorithms basically consist of five steps: (1) feature extraction, (2) dimensionality reduction, (3) training, (4) performance testing and (5) the practical application of the model on real-world data.

Neural networks are very powerful algorithms most importantly because of the beauty of their mathematical logic. They are an abstract of how parts of biological brains are functioning. In biological brains the neurons receive signals through their dendrites, which are weighted based on their frequency and importance and accumulated within the so-called nucleus. As soon as they achieve a certain threshold, the neuron fires the signal to other neurons.

The example in the figure below shows an artificial neural network with one input layer, three hidden layers and one output layer. Each of the layers consists of neurons that are connected with the neurons of the next layer.

Neural networks can be mathematically expressed by the formula below, which consists of three main components: (1) topology, (2) training algorithm and (3) activation function – elements that characterize almost all neural networks.

The topology defines the general structure of the network, for instance, how many layers the network contains or how many neurons each layer includes. The training algorithm is used to calculate the weights and the activation function fires the signal to the next connected neuron if the input signal reached a certain threshold. For this neural network I used a softmax function because they are better suited for multiple class predictions than the sigmoid function.

Before we start the training process we perform a feature extraction and dimensionality reduction by using an LSTM vectorizer. An LSTM vectorizer transforms each text into a vector, by which information will be stored during the loop of the training process. This allows the algorithm to place a word into the context of another word because it saved relevant previous information in the LSTM vector during the training process. For the following neural network, the topology will have four layers: (1) an embedding layer with a maximum number of 50,000 words and a dimension of 100, (2) a spatial dropout layer with a dropout rate of 0.2, (3) an LSTM layer with 100 memory objects and (4) a dense layer which has a length of the total number of classes and which includes the activation function.

Now let’s implement it by using Python and Keras. In the code below we first import the libraries that we need for training and testing the model. We use Keras, which is based on Google’s Tensorflow, to build and train the model.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
import pickle

In the next step, we will load the dataset into the program and create the dataframe object.

df = pd.read_csv('Training_Data_Raw_Clean_True_Sample_500_FINAL.csv', sep=',', 
encoding='utf-8')

In order to make sure that we have a balanced dataset, we plot the distribution of the classes. Maybe we have to organize some more data or weight the data by their distribution.

fig, ax = plt.subplots()
fig.suptitle('Class', fontsize=12)
dtf['Class'].reset_index().groupby('Class').count().sort_values(by='index').plot(kind='barh', 
legend=False, ax=ax).grid(axis='x')
plt.show()

In the next step we define several parameters before it moves on with some data reprocessing steps. We define the maximum number of words in the bag of words with the most frequent keywords, the maximum length of a document and the size of the embedding layer.

MAX_NB_WORDS = 50000
MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 100

What follows is that we create a tokenizer, fit the text on the tokenizer and create a word index. After the training of the model is finished we will save the tokenizer as pickle file because we will need it later to predict new data.

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', 
lower=True)
tokenizer.fit_on_texts(df['Text_Top100'].values)
word_index = tokenizer.word_index
print('Dataset includes %s unique tokens.' % len(word_index))

We now use the tokenizer to reduce the features and to transform them into a vector with the same length for the model.

X = tokenizer.texts_to_sequences(df['Text_Top100'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)

In the next step we transform the classes (Y ) to dummy variables. The reason why we are doing this is that classes are not cardinal or ordinal variables, which is why we need to transform them into a language a machine can understand.

Y = pd.get_dummies(df['Class']).values
print('Shape of label tensor:', Y.shape)

Now we split the dataset into a training and a testing dataset, of which the former will consist of 90 % of all firms in the dataset and 10 % in the testing dataset. The parameter random state is a seed value, which allows us to reproduce the same random data split.

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, 
random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

Now, we are at the point, where we can develop the model. In the code snippet below we initiate the model and define its layers and their parameters.

model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(30, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

After the model is defined, we define few parameters for the training process and train the model. The batch size parameter helps us to avoid memory overload during the training process.

epochs = 10
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, 
validation_split=0.1, callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

"""
Epoch 1/10
190/190 [==============================] - 48s 233ms/step - loss: 3.3299
accuracy: 0.0568 - val_loss: 2.7465 - val_accuracy: 0.1616

Epoch 2/10
190/190 [==============================] - 46s 243ms/step - loss: 2.6218
accuracy: 0.2174 - val_loss: 2.4607 - val_accuracy: 0.2276
Epoch 3/10

190/190 [==============================] - 55s 287ms/step - loss: 2.1239
accuracy: 0.3501 - val_loss: 1.9390 - val_accuracy: 0.3988
Epoch 4/10

190/190 [==============================] - 47s 249ms/step - loss: 1.5604
accuracy: 0.5294 - val_loss: 1.5267 - val_accuracy: 0.5448
Epoch 5/10

190/190 [==============================] - 46s 242ms/step - loss: 1.0613
accuracy: 0.6898 - val_loss: 1.4558 - val_accuracy: 0.5589
Epoch 6/10

190/190 [==============================] - 47s 249ms/step - loss: 0.7276 
accuracy: 0.7973 - val_loss: 1.2062 - val_accuracy: 0.6679
Epoch 7/10

190/190 [==============================] - 46s 240ms/step - loss: 0.4754
accuracy: 0.8782 - val_loss: 1.0980 - val_accuracy: 0.7116
Epoch 8/10

190/190 [==============================] - 45s 236ms/step - loss: 0.3312
accuracy: 0.9184 - val_loss: 1.3245 - val_accuracy: 0.6575
Epoch 9/10

190/190 [==============================] - 46s 242ms/step - loss: 0.3163
accuracy: 0.9246 - val_loss: 1.1268 - val_accuracy: 0.7079
Epoch 10/10

190/190 [==============================] - 45s 237ms/step - loss: 0.2286
accuracy: 0.9479 - val_loss: 1.0447 - val_accuracy: 0.7487

After the training process of the model is finished we evaluate the model on the testing data and plot the accuracy and loss function.

accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show();

plt.title('Accuracy')
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.show();

Now we can apply the model in order to predict new documents. We need the tokenizer to transform the new input data. Note that the order of the labels doesn’t matter because Keras orders them alphabetically.

New_Website = ['PORR Management World Tiefbau Video Hochbau Infrastruktur 
Medien Publikationen Nachhaltigkeit Public Hansen Lehre Erdbau Altlasten 
Umwelttechnik Design Engineering Abdichtung Jobbörse Investor Relations 
Compliance Einkauf Karrierewege Weiterbildung Images flags czech eská 
Republika e tina Organisation Magazin Digitales Fachmagazin Länderseiten 
Français Polska Polski România Român Slovensko Sloven Revitalisierung Palais 
Hotel Bauüberwachung Großprojekte Einkaufszentren Industrie Sonderbauten Bauten 
Stadien Wohnbau Bahnbau Brückenbau Ingenieurbau Kraftwerksbau Leitungsbau 
Spezialtiefbau Straßenbau Tunnelbau Überregionaler Wasserbau Rückbau Deponie 
Kies Transport Umweltlabor Architektur Bauphysik Bauvorbereitung Brandschutz 
Building Modeling Generalplanung LEAN Techn Tragwerksplanung Beschichtung 
Betondeckenbau Facility Fassadenbau Feste Fahrbahn Flughafenbau Health Care 
Hochgebirgsbau Hochhäuser Partnership Property Stahlbau Broschüren 
Ansprechpersonen Öffentliche Abbruch Traineeprogramm']

sequence = tokenizer.texts_to_sequences(New_Website)
padded_sequence = pad_sequences(sequence, maxlen=MAX_SEQUENCE_LENGTH)
pred = model.predict(padded_sequence)

labels = ['Class_10_Holzverarbeitung', 
'Class_11_Glas_Keramik_Porzellan', 
'Class_12_Optik_Schmuck_Uhren_Edelmetalle', 
'Class_13_Zellstoffe_Papier_Verpackungen', 
'Class_14_Textilien_Leder', 
'Class_15_Lebensmittelerzeugung_Genussmittel', 
'Class_16_Landwirtschaft_Forstwirtschaft_Gartenbau', 
'Class_17_Logistik_Transport', 
'Class_1_Abfallwirtschaft_Energie_Wasserwirtschaft', 
'Class_20_Medizin_Healthcare', 
'Class_21_Pflege_Betreuung_Soziale_Dienste', 
'Class_22_Banking', 
'Class_23_Unternehmensberatung_BWL_Dienste', 
'Class_24_Immobilien', 
'Class_25_Rechtswesen_Anwaelte', 
'Class_26_Marketing', 
'Class_27_Medien_Druck_Verlagswesen', 
'Class_28_Kultur_Kunst_Brauchtum', 
'Class_29_Sport_Freizeit_Events', 
'Class_2_EDV_Telekommunikation_Elektronik', 
'Class_30_Gastwirtschaft_Restaurant_Touristik', 
'Class_31_Friseure_Beauty_Wellness', 
'Class_32_Schulen_Fortbildungseinrichtungen', 
'Class_33_Behoerden_Parteien_Verbaende', 
'Class_3_Maschinen_Anlagen_Werkzeuge', 
'Class_4_Automobil_Fahrzeuge_Motoren', 
'Class_5_Metallverarbeitung', 
'Class_6_Pharmazie_Biotechnologie_Bio_Engineering', 
'Class_8_Baubranche_Handwerk', 
'Class_9_Rohstoffe_Baustoffe']

print(pred, labels[np.argmax(pred)])

['Class_8_Baubranche_Handwerk']

Finally, we save the model and the tokenizer in our working directory, so that we can used them later and apply them in the real world

model.save('C:/Users/Dokumente/Developing/Python/Branch Classifyer')
filename = 'Tokenizer_ANN.sav'
pickle.dump(tokenizer, open(filename, 'wb'))

If we wanted to use the model later, we could simply load the saved model and the tokenizer into the program by using the following commands.

model = keras.models.load_model('C:/Users/Dokumente/Developing/Python/Branch Classifyer')
tokenizer = pickle.load(open('Tokenizer_ANN.sav', 'rb'))

That’s it.

Apps

Connect

More