How We Used ML While Building (Yet Another) Remote Job Board
We've explored all remote tech job aggregators. None of them is remote-proven and has strict geo-restrictions. Read about how we taught the algorithm.
Join the DZone community and get the full member experience.Join For Free
It was a nice day at the end of 2020 when we suddenly decided to create another aggregator for remote vacancies, exclusively for IT positions. It would be logical to ask why to make another one when there are already enough of them on the market. The answer is straightforward — we understood how to improve current solutions in at least five parameters:
- Quantity: to aggregate the most in the world;
- "Really" remote vacancies: not only "remote until COVID-19";
- Relevance: often, on similar sites, you can find a large number of irrelevant vacancies;
- Power of the search engine (in my opinion, the search on current sites with deleted vacancies is at the level of 2005);
- Filter by citizenship.
As a matter of fact, it is about the last parameter that I want to tell you today.
For anyone who has ever searched for a remote job, it is obvious that often companies offer remote work, but only for citizens of certain countries.
There is no separate field on the pages with job descriptions where such restrictions can be displayed most of the time. And there is no search/filter. Therefore, the applicant has to carefully read the text of each vacancy to understand whether it makes sense to respond to it or he/she will definitely not pass based on citizenship.
We decided to solve this problem, basically, to show the user only those vacancies for which he/she can really apply, given their citizenship.
At first, we thought to solve this problem with simple algorithmic methods. The basic idea was:
We are looking for certain keywords in the text, for example: "only", "remote in", "authorized to work in", and so on.
We are looking for a "location" next to the keywords, which, as a rule, was a word with a capital letter. If such a location is found, then it is a restriction.
In general, if the vacancy says "USA only", then this logic works perfectly. However, after analyzing only about 500 vacancies, it became clear that the restrictions can be indicated differently, for example:
- This role is remote, and you can be based anywhere across the UK.
- Living in Europe is a must.
- This opportunity is only open to candidates within Canada at this time.
- Location: Argentina (any part of the country it’s great for us!)
- And hundreds of other descriptions.
It became clear that the algorithms could not pull the problem, and it was decided to try to use the power of ML.
Just in case, I will announce the problem again. In the input, we have a text describing the vacancy, which usually contains a company's description, a technology stack, requirements, conditions, benefits, etc. In the output, we should have parameters:
restriction: 0 (no) / 1 (yes)
if restriction = 1, then it is also necessary to highlight the country for which there is a restriction
As I wrote above, we have a large text at the input, which usually contains a bunch of everything, and therefore the task was somewhat more difficult than just writing a regular classifier. First, it was necessary to find what exactly to classify.
Given that we were looking for location restrictions, we decided to find all the text locations first. Then select all sentences that contained these locations and write a classifier for them.
We also tried to solve the problem "head-on": find a list of all countries and cities and just search for their text occurrence. But again, the task was not so easy.
First, the restrictions applied to countries and capitals of the world and small cities and states (for example, "Can work full time in Eugene, OR / Hammond, IN"). And making a list of every city in the world was difficult enough.
Secondly, the writing of vacancies locations often differed from the standard (for example, "100% Remote in LATAM").
Therefore, we decided to use NER to highlight locations. We tried different existing methods:
The choice fell on spaCy because EntityRecognizer showed the best result out of ready-made and free options.
Total: we managed to highlight locations in the text.
Splitting Into Sentences
We also used spaCy to split the text of the vacancies on sentences with locations inside them. At the output, we received a list of them. Here are examples of such sentences:
- The position is remote, so the only thing is they have to be in the US and work Eastern or Central time.
- This job is located out of our Chicago office, but remote, US-based applicants are still encouraged to apply.
- This is a remote role, but we're looking for candidates based in Montreal, Canada.
The model was supposed to mark these sentences. It is important — we did not have the opportunity to make a dataset with tens of thousands of such sentences (this takes a lot of time), so when selecting a model, we had to take these limitations into account.
We decided to try several models, including both simpler CNN and LSTM and more modern transformers. The latter, predictably, turned out to be better, the training of which was essentially reduced to fine-tunning — this definitely suited us because the dataset, as I said above, was not large.
Among transformers, the RoBERTa architecture (roberta-base) showed the best result with an accuracy rate of 94% for our dataset.
Based on the classifier and NER for each vacancy, we received the following additional fields:
restriction: 1 (yes); location: London
Classifier gave us Restriction. But NER gave Location. Since the Location field could have different spellings of cities and countries, we also made additional normalization through the Google API. We decided at making country restrictions.
So, the output turned out like that:
restriction: 1 (yes); location: United Kingdom
As a result, we now know how to do this, and candidates can filter vacancies that are not suitable for them.
P.S. I didn't want to promote the aggregator here, so I'll just leave it as the reference.
Opinions expressed by DZone contributors are their own.