Scikit & something to do while it runs

I decided to play around with some of the various settings in the neural network to see if I could get better accuracy than with linear regression. Well I stumbled across this MLPClassifier class in scikit… once again, easy peasy just what you want. I am starting to really like these guys.

A good place to start seemed like the size of the hidden layer, since I get the sense that’s a pretty critical parameter. The conventional wisdom seems to be that it should be somewhere between the size of the inputs and the size of the outputs. So for this project, somewhere between 1 and 1024. Right now it’s at 256. Sometimes people say to make it smaller, sometimes people say to make it bigger. Who knows? So I tried making it smaller (128), and I got a lot of this business:

warnings.warn(
C:\Users\John\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

That doesn’t sound like good news. So I tried making it 512… still the same warnings. So I tried 1024 and the warnings went away!

The results really weren’t what I was looking for, though. I would describe them as “overconfident”. As in, the probabilities were always either 99% or 1% for each mood. It started to look like the linear regression did before we discovered that predict_proba() function. Not very interesting! I want something more nuanced.

That’s when I discovered yet another nifty feature of scikit, thanks to this dude – the GridSearchCV function. This thing will chop up my dataset in various ways and hold a contest between all the different variations of parameters, all with one function call:

# read data
train = pd.read_csv('elmo/train_' + mood + '.csv')

# load elmo_train_new
pickle_in = open('elmo/elmo_train_' + mood + '.pickle', 'rb')
elmo_train_new = pickle.load(pickle_in)

param_grid = [
    {
	'activation' : ['identity', 'logistic', 'tanh', 'relu'],
        'solver' : ['lbfgs', 'sgd', 'adam'],
        'hidden_layer_sizes': [
            (128,),(256,),(512,),(1024,)
        ]
    }]
	
# create grid search
grid_search = GridSearchCV(MLPClassifier(), param_grid, scoring='accuracy')
grid_search.fit(elmo_train_new, train['label'])

# print best parameters
print('best parameters for ' + mood + ':')
print(str(grid_search.best_params_))

I don’t even know what any of this stuff means yet, and scikit is just going to find the best ones for me!

Unfortunately, this process is going to take something on the order of 18 hours, as far as I can tell… it’s already been running for about five and a half, and it’s only finished four of the moods. Here’s what the choices look like so far:

best parameters for grateful:
{'activation': 'identity', 'hidden_layer_sizes': (128,), 'solver': 'sgd'}
best parameters for happy:
{'activation': 'relu', 'hidden_layer_sizes': (128,), 'solver': 'adam'}
best parameters for hopeful:
{'activation': 'relu', 'hidden_layer_sizes': (1024,), 'solver': 'adam'}
best parameters for determined:
{'activation': 'relu', 'hidden_layer_sizes': (1024,), 'solver': 'adam'}

Interestingly, but obvious in hindsight – we can configure the networks differently for each mood! This makes sense since the dataset is so small, and some moods are present much more frequently than others. For some of the moods that don’t appear very often, there may really be no optimal solution.

Anyway, since I have something like 12 hours to kill, I need to find something else to do! And I think I have settled on a new project. I have this desktop Java program that I wrote for myself that I call “Ledger”, it is basically an electronic checkbook. I actually started this blog thinking that rewriting it for the web would be my first project, and I was going to use Java/Spring on the back end and Bootstrap & JQuery on the front end.

Well that became a rather frustrating endeavor, because the grids would jump around all over the place whenever I tried to do anything – because there was always a round trip to the server, and everything had to redraw itself afterwards. This really needs to be a single page application! And now that I am more familiar with React and Django, I think we have a better stack all around. We can take advantage of Django’s automatic database persistence and maybe not have to maintain all the classes that look just like database tables.

So, I’ll be jumping back and forth between these two projects as time permits. Hopefully that will not make the blog too confusing.

Linear Regression

I have taken a step back and learned a little bit about what Linear Regression actually is. As far as I can tell, it’s what I have described here with the red and blue dots. It’s not a “neural network” at all in the sense that it has no hidden layers. That’s how I think of a neural network anyway, it doesn’t really get interesting until you have hidden layers. That might not be the official definition though.

I am also refreshing my memory on what the hidden layers actually do. It seems they combine the functions from previous layers in order to approximate functions that are non-linear. So my red-and-blue-dots picture for networks with a hidden layer would look more like this:

And as you add more layers you combine these more complicated functions, and of course it’s all in hundreds of dimensions so it can get quite complicated.

It also makes sense that the more layers you have, the more danger there is of “overfitting”. It is hard to over-fit with a straight line, but with an arbitrarily complicated series of segments you could carve out a little spot for each data point and ignore everything else.

Taking this back to my lousy keras model, we can approximate scikit’s LinearRegression() model with something more like this:

# build model
elmo_input = Input(shape=(1024,), dtype="float")
pred = Dense(1, activation='linear')(elmo_input) 
model = Model(inputs=[elmo_input], outputs=pred)
model.compile(loss='mse', optimizer='sgd', metrics=['accuracy'])

I got ideas for the various parameters from this fella. Anyway this model behaves much more like the scikit one; it’s not perfect, but it gives different answers for different inputs and the answers seem pretty good from an intuitive standpoint.

So what can we learn from this? It seems that the magic stuff going on in ELMo makes their output vectors pretty well suited to linear regression. That’s pretty handy, since linear regression is easy! Doing something fancier is going to take a little more work in adjusting parameters, because there are more things to adjust.

Scikit FTW!

I haven’t learned much about why the ELMo+LSTM combination isn’t working so well, but I did learn something nifty about scikit! If you change this little bit of code:

classification = lreg.predict(self.elmo_text)
return int(classification[0]) * 100

to look like this:

classification = lreg.predict_proba(self.elmo_text)
# probabilities are [[probability_false, probability_true]]
return classification[0][1] * 100

You get a nice looking histogram for ELMo! And just judging from typing in a few random sentences, it is way more accurate than LSTM. This approach is the clear winner so far.

Along the way of trying to fix ELMo+LSTM I discovered keras, which seems kind of nifty… you can create a model with just a few lines of code like this:

# build model
elmo_input = Input(shape=(1024,), dtype="float")
dense = Dense(256, activation='relu', kernel_regularizer=keras.regularizers.l2(0.001))(elmo_input)
pred = Dense(1, activation='softmax')(dense)
model = Model(inputs=[elmo_input], outputs=pred)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

That looks nice, doesn’t it? Each layer in the model is a line of code, and each line has just a few parameters. It feels like you could actually learn what they mean. Unfortunately, feeding the ELMo vectors into this gives really lousy results. Like, so bad I’m not even going to post it. It just gives all 0’s for any input. Not sure why.

The scikit model, on the other hand, goes like this:

# create and train classification model
lreg = LogisticRegression()
lreg.fit(xtrain, ytrain)

Bam! That’s it. It does really well with the ELMo vectors, but it’s black box magic. I have no idea what’s going on inside LogisticRegression(), and if you look at the documentation… holy hell what a bunch of gobbledygook. I feel like it will be a while before I understand what’s going on in there. But perhaps if I can get one of the other packages to approximate this LogisticRegression thing, I can figure it out.

ELMo + LSTM

I decided to try feeding the ELMo vectors from my second example into the LSTM classifer from the first, and the results are rather underwhelming. It gives the same answers for any input for every mood except possibly “determined” or “aware” (one of them changes a little bit each time, not sure which).

So, something is not quite right in there. One thing I did was to take out the “embedding”, since it needed a “vocabulary size” and there is really no vocabulary size in an array of floating point numbers. It’s also possible that LSTM is just not the right kind of network for this problem, or that any one of the parameters is not quite right… I will have to do some research.

Nightmares

OK, the elmo code was so horrifying that I really wasn’t going to be able to sleep tonight unless I fixed it. I added another step to the data preparation that saves the final scikit model to a file, and then I updated the ElmoClassifier to load the model instead of training it every time.

Also to my horror I realized I was generating the ELMo vectors for the input sentence thirteen times, instead of just doing it once. That is mildly embarrassing. But the important thing is, I fixed it! ELMo now clocks in at a zippy 5 seconds or so.

GitHub stuff

I finally have all the code up on GitHub, or at least the relevant parts anyway. It is under the moodilyzer directory.

I made a config file for the react stuff to put the URL’s in, because that seems like the only bit of sensitive information that the react side contains.

for the django side, I only included the lstm and elmo content and the basic directory structure. The rest of it is boilerplate code that can be found in any tutorial, and all it does is reveal configuration information that I probably shouldn’t reveal. Not that anyone is really hacking into my website but who knows?

I also included the dataprep scripts for both methods, which show how to generate the files that are loaded when classifying a sentence.

Next steps:

  • Make the ELMo code way faster by saving/loading the last stage classifiers instead of training them every time
  • Create a third scheme that feeds the ELMo vectors into the LSTM classifier from the first approach
  • Generate some metrics so we can compare them!

Stay tuned!

ELMo

I have discovered a fascinating new (to me) approach to this language problem, adorably named “ELMo” by its authors. There’s another NLP program called “BERT”, so I guess they are on a Sesame Street tear.

Anyway, what I really like about ELMo is that you can encode any sentence into a vector of the same length, which makes it perfect for input into a neural network. ELMo actually generates a vector for every single word, but they are engineered in such a way (I think) that you can average the vectors of a sentence together to get a sort of “sentiment vector” for the whole sentence. That last bit could be total garbage, but the guy in the tutorial I found did it, so I’m running with it for now.

There’s some serious magic going on in these vectors that ELMo generates, and I won’t pretend to understand it quite yet. But they are big – the ones I used are 1024 in size, so that’s quite a bit of room to put features. It is easy to imagine that a neural network could pick up on the features you’re interested in, and ignore the rest. And we can let go of all our worries about sentence length and parts of speech and so on, because magic!

My code needs a little work still. The response time is very slow compared to the LSTM, on the order of a minute. That is not ELMo’s fault, I am completely doing a lot of work on every pass that I should just be loading from a file.

Also, the classifier used in the tutorial seems to spit out 1’s and 0’s instead of a floating point number, which is kind of boring. The solution may be to feed the ELMo vectors into the LSTM classifier that I used in the first pass. The guy in the tutorial did say that this was a basic implementation, obviously he didn’t want to ruin his contest by making it completely awesome. So I think there is room for improvement!

Django + React = Cranium Explosion

I don’t mean that in the good way. I had a lot of noob problems with this setup. First of all, where is this “don’t repeat yourself” mantra of Django, I ask myself as I am opening up the settings file at moodilyzer/moodilyzer/moodilyzer/settings.py, and editing my templates at moodilyzer/moodilyzer/nn/templates/nn/index.html. Someone did not think this motto through.

The next problem is CORS. Mind you I’m not even doing CORS on the deployment server! It’s all coming from the same place. But on my development machine, React wants to be on port 3000, and Django wants to be on port 8000. So they think I’m committing some sort of crime when I want them to play together. I have tried all the various fixes suggested on the internet and none has worked so far, so I am leaving this complaint here as I reminder to myself to post the answer when I find it. I’m sure it’ll be a doozy. My other option is to run npm build and stuff my react code into Django every time I make a change in development, but that really loses all of the slick development machinery that the React folks went to all the trouble to come up with.

What next… jquery ajax POST requests aren’t working. I suspect it is probably how the post data is formatted, because the request goes through as if there were no data. For now I am using GET requests since it is behind the scenes anyway, and I will probably try axios instead since I should be moving on to something more hip than jquery. I mean how gauche.

The real winner here is React-Bootstrap, which allowed me to build a somewhat slick-looking UI. Perhaps ‘slick’ is the wrong word, but let’s say much better than the raw HTML UI I started with! Check out Moodilyzer.js to see how it works. One interesting bit is how to deal with the CSRF token that Django requires. That is accomplished with the getCookie() function and the X-CSRFToken header in the ajax call. I got the getCookie() function from this fellow.

Moodilyzer!

I have created the first version of the “Moodilyzer“, a neural network that analyzes your mood! It is hilariously bad. However, it does something, and that is a starting point for improvement.

I used 13 single-output neural networks as discussed in earlier posts, and I put them behind a django server using mod_wsgi. There are great tutorials on the django/apache piece at djangoproject.com, Digital Ocean, python.org, etc. I really didn’t have to do much outside of the lines there, except uninstall python 2.x, and add this little nugget to wsgi.py, let’s see if I can highlight it sensibly:

import os
import sys

from django.core.wsgi import get_wsgi_application

path = u"/path/to/django/moodilyzer/moodilyzer"

if path not in sys.path:
    sys.path.append(path)

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'moodilyzer.settings')

Hmm it seems I need a plugin to highlight things. Not big on plugins at the moment, I have been doing just fine with the basic functionality. But anyways here’s the part I added:

if path not in sys.path:
    sys.path.append(path)

I am sure there is another way to do that. But this one also works.

I also made the rookie mistake of posting stuff to github with my local directory information in it! I quickly fixed it, but it is no doubt visible in the history so I just moved my django stuff. I had it in apache’s html directory, which I’m not supposed to do anyway, so I was going to have to move it eventually.

But anyway, on to the interesting part – the neural network! In order to get this to work I had to reduce the batch size to 1, because otherwise I guess it’s expecting me to submit 55 sentences every time? That makes the training rather slow, so I separated the scripts into training and inference, with the inference embedded in the django application. Inbetween, the model is saved to a file with pickle. I had to separate the SentimentalLTSM class into its own file so both sides could use it. This works really well! Pytorch can load all of the models pretty instantaneously. It only takes on the order of a second to get results.

The results of course are all over the place, and I think there are many reasons for that. A neural network, if I recall correctly from my studies, is nothing more than a hyperplane separator. This means that it is really good at separating groups of things like so:

But it is really bad at separating things that look like this:

And I think that second situation is what we have created here. Particularly in how the dictionary of words was created. It was created strictly by popularity of words:

# create word-to-int mapping
vocab_to_int={w:i+1 for i,(w,c) in enumerate(sorted_words)}

Let’s say we are trying to classify “good” moods vs. “bad” moods. Well the word “good” was assigned a value of 38, and “bad” 382. But then “awesome” is at 740! There’s no number you can pick that divides good words from bad words. And that’s essentially what we’re trying to do.

So in order to fix that, we’d have to find some other way to assign numbers to words – one that somewhat reflected where they were on the spectrum of emotion we’re trying to capture. For example, in the “happy” neural network, we might want to assign 100 to “happy”, and -100 to “sad”. Then we train the output such that positive numbers are happy, negative numbers are sad, and something around 0 is “can’t tell”.

But what about the word “not”? What if the text we’re trying to analyze is “I am not happy”? In that case we’d want to negate the input of the word “happy”, in essence equate “not happy” with “sad”. But in order to do that kind of thing, we have to start parsing sentences, paying attention to parts of speech and so on. Which is something I think we’d want to do anyway, because consider the following sentences:

I am not happy.
When it rains I am not happy, and I don’t know why.

It would be reasonable to expect a computer to recognize both of these sentences as “sad”, despite the conditional nature of the second one. But these sentences are processed completely differently, because they are of different lengths. The “I am not happy” in the first sentence will be placed into different inputs than the “I am not happy” of the second sentence, and we are expecting the neural network to just “figure it out” even though we are not giving it all the information we have!

What if we had specific inputs for different functional parts of the sentence? For example if there were a “primary subject” input, then the “I” from both sentences could go in the same place. Of course we need some kind of parser that can do this, and I will explore that in future posts.

The bottom line is that these neural networks are not magic. I am reminded of an AI class I audited while I worked at Northwestern. The professor started off the first class with a game. We were to tell him a number, and then he would do a formula in his head and say another number as the result, and we had to guess the formula. The numbers went something like this:

9 → 4
4 → 4
13 → 8
10 → 3
3 → 5

And so on. Of course the class is full of nerds so we’re all trying to write down some mathematical formula that does this to all these numbers, and nobody can come up with one. Finally he reveals the trick – the second number is the number of letters when you write out the first number.

Without knowing that piece of information, you are really going to struggle trying to transform those numbers. But once you know the trick, it’s simple. His point was that when it comes to AI, if you know what you’re looking for things are going to come out much simpler than if you don’t. And I think it is the same way with what we’ve created here. We are hiding a lot of things from the neural network and expecting it to magically figure them out, just like the game in the AI class.

Whether or not we can actually give the neural network the information it needs is yet to be seen, and may be beyond what I really want to accomplish with this blog. But we can sure give it a try!

New Home

Until now my “home in the cloud” has been DigitalOcean, and I’ve been pretty happy with them. Five bucks a month for a VPS, what’s not to like? But then I upgraded it to ubuntu 20.04, and the 1G of memory on my little cheapo machine could not keep up.

I looked around at other providers and it seemed that most of them were charging about $5/G of memory. But then I found this place VPSDime, that was charging $7 for a VPS with 6G of memory! I was immediately suspicious, of course. Is this a fake website that is going to rip me off? Is this part of some international spy ring?

I signed up anyway, though, because that’s how I roll. And so far this thing is blazingly fast, and I haven’t been ripped off or taken to international spy prison. So I am calling this a success!