{{announcement.body}}
{{announcement.title}}

Everything You Need to Know About Scikit-Learn Python Library

DZone 's Guide to

Everything You Need to Know About Scikit-Learn Python Library

Take a look at the Scikit-Python library, including its implementation, training a model, and some additional tips.

· Web Dev Zone ·
Free Resource

If you have just taken your first step in the data science industry and are learning the Python programming language then, being a Pythonist, you should be aware of the Scikit-learn library. If you are seriously considering bringing data science and machine learning into a productive system, then you should be comprehensive for the Scikit-learn Python library. In this article, let us explore the Scikit-Learn Python library and learn different aspects of its utilization.

Background of Scikit-learn

Scikit-learn is also known with the synonyms like scikits.learn (previously known) or sklearn. It is a free and open-source machine learning library that is used for the Python programming language. The library was developed by David Cournapeau as a Google Summer Code project in 2007. The project was later joined by Matthieu Brucher in 2010. The library was first made public in February 2010, and in just two years, that is, in November 2012, the library became one of the most popular libraries of machine learning on Github. The primary features of the Scikit-learn library include classification, regression, and clustering algorithms (support vector machines, random forests, gradient boosting, k-means, and DBSCAN). The sklearn is designed to deal with numerical and scientific libraries of Python like NumPy and SciPy. 

Scikit-learn Implementation

Sklearn is utilized majorly in Python programming language and NumPy is used to extend its high-performance in linear algebra and operations. Some of the core algorithms that are written in Cython also use this library to improvise the performance.

Scikit-learn Integration

In Python programming language, Scikit-learn library can be integrated with the other Python libraries to fulfil different tasks:

  • NumPy - (Array Vectorization) Base n-dimensional array package
  • SciPy - Fundamental library for scientific computing
  • Matplotlib and Plotly - (Plotting) Comprehensive 2D/3D plotting
  • IPython – An enhanced interactive console
  • Sympy - Symbolic mathematics
  • Pandas - Data structures and analysis
  • SciPy - Scientific calculations

The extensions are carefully taken care of by SciKits, but as the module includes learning algorithms, thus it was named as Scikit-learn.

Highlights of Scikit-learn

  • Clustering is used for grouping unlabeled data, e.g. KMeans.

  • Cross-validation is used for performance estimation of supervised models on the unseen data.

  • Datasets are used for test datasets and for the generation of datasets with particular properties for the investigation of model behavior.

  • Dimensionality Reduction is used for reducing the number of attributes in data for summarization, visualization and feature selection such as principal component analysis.

  • Ensemble methods are used for combining the predictions of multiple supervised models.

  • Feature extraction that is used for defining attributes in image and text data.

  • Feature selection is used for identifying meaningful attributes from which supervised models are created.

  • Parameter Tuning is used for getting the most out of supervised models.

  • Manifold Learning that is used for summarizing and depicting complex multi-dimensional data.

  • Supervised Models is a vast array not limited to generalized linear models, discriminant analysis, naive Bayes, lazy methods, neural networks, support vector machines and decision trees.

Who Are the Users of Scikit-learn Python Library?

The Scikit-learn is used by tech giants like Inria, Mendeley, wise.io, Evernote, Telecom Paris Tech and AWeber. Other than these there are tens to hundreds of big firms using this library.

Documentation of Scikit-learn Python library

Example: Sentiment Analysis with Scikit-learn

Text-classification is performed with the help of NLP and it has an assortment of applications like user sentiments detection, classification of email as spam or ham, classification of blog posts into automatic tagging, categories, and so on. Follow below 8 steps to create text classification model in Python:

Importing Libraries

Firstly, execute the following script to import the libraries that are required for the execution.

Java
 




x
13


 
1
import NumPy as np
2
 
          
3
import re
4
 
          
5
import nltk
6
 
          
7
from sklearn.datasets import load-files
8
 
          
9
nltk.download('stopwords')
10
 
          
11
import pickle
12
 
          
13
from nltk.corpus import stopwords



Importing The Dataset

To import the dataset into our app, we’ll use the load_files function from sklearn_datasets library. This function will automatically divide the dataset into a target set. In our example, we’ll pass load_files path to the “txt_sentoken” directory. Load_files will consider each folder inside “txt_sentoken” folder as one category and all documents inside the folder will be designated as corresponding category. Let’s, execute the following code:

Java
 




xxxxxxxxxx
1


 
1
movie_data = load_files(r"E:\txt_sentoken")
2
 
          
3
x,y = movie_data.datamovie_data.target
4
 
          



The load_files used in the above code will load data from both ‘neg’ and ‘pos’ folders into X variable and store it in the target category as y. In the above code snippet, x is a list of 1000 string type elements, whereas, y is a numpy array with size 1000. If you want to print y on the screen then you will get output of 1s and 0s. The reason behind is that load_files function adds a number to the target numpy array. For ‘neg’ and ‘pos’ categories, 1s and 0s have been added to the target array.

Text Preprocessing

After you have imported the dataset, it is time for the preprocessing of the text. Text may be a combination of numbers, special characters, blank spaces, etc. Based upon the requirement, we can add, remove, or edit text. In the below code snippet, we will remove the blank spaces from the mentioned text. We’ll be using Regex Expressions from Python re library.

Java
 




xxxxxxxxxx
1
63


 
1
documents = []
2
 
          
3
 
4
 
          
5
from nltk.stem import WordNetLemmatizer
6
 
          
7
 
8
 
          
9
stemmer = WordNetLemmatizer()
10
 
          
11
 
12
 
          
13
for sen in range(0len(X)):
14
 
          
15
    # Remove all the special characters
16
 
          
17
    document = re.sub(r'\W', ' ', str(X[sen]))
18
 
          
19
    
20
 
          
21
    # remove all single characters
22
 
          
23
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
24
 
          
25
    
26
 
          
27
    # Remove single characters from the start
28
 
          
29
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document
30
 
          
31
    
32
 
          
33
    # Substituting multiple spaces with single space
34
 
          
35
    document = re.sub(r'\s+', ' ', document, flags=re.I)
36
 
          
37
    
38
 
          
39
    # Removing prefixed 'b'
40
 
          
41
    document = re.sub(r'^b\s+', '', document)
42
 
          
43
    
44
 
          
45
    # Converting to Lowercase
46
 
          
47
    document = document.lower()
48
 
          
49
    
50
 
          
51
    # Lemmatization
52
 
          
53
    document = document.split()
54
 
          
55
 
56
 
          
57
    document = [stemmer.lemmatize(word) for word in document]
58
 
          
59
    document = ' '.join(document)
60
 
          
61
    
62
 
          
63
    documents.append(document)



The preprocessing step is known as lemmatization which is used to reduce word into the root form of a dictionary. For example, rats can be converted into a rat. Lemmatization is used to avoid semantically created features that are syntactically different.

Conversion of Text into Numbers

Machines do not understand the raw text as humans and they only understand numbers in binary form (0s and 1s). As we are dealing with machine learning concepts, we need to convert text into numbers so that machines could understand what is written. Here, two approaches can be used- “the bag of words model” and “the word embedding model”.

Let us use TFIDF words approach TF is Term Frequency and IDF is Inverse Document Frequency.

Calculation of Term Frequency (TF) is done by:

Term frequency = (Number of Occurrences of a word)/(Total words in the document

Calculation of Inverse Document Frequency (IDF) is done by

IDF(word) = Log((Total number of documents)/(Number of documents containing the word)) 


This states that the TFIDF value for a specific document is higher if the occurrence of the word is higher than the particular document but lower in rest of the documents.

The following code will help in the conversion of values obtained using the bag of words model into TFIDF.

from sklearn.feature_extraction.text import TfidfTransformer

Java
 




xxxxxxxxxx
1


 
1
tfidfconverter = TfidfTransformer()
2
 
          
3
X = tfidfconverter.fit_transform(X).toarray()



Or in case you want to directly convert text documents into TFIDF feature values, use the following code snippet:

from sklearn. feature_extraction.text import TfidfVectorizer

Java
 




xxxxxxxxxx
1


 
1
tfidfconverter = TfidfVectorizer(max_features=1500min_df=5max_df=0.7stop_words=stopwords.words('english'))
2
 
          
3
X = tfidfconverter.fit_transform(documents).toarray()



Training and Test Sets

In case of any supervised machine learning problem, the understanding of training and test sets are required. Use train_test_split utility from sklearn.model_selection library. The below code snippet will divide data into 20% test set and 80% training set.

Java
 




xxxxxxxxxx
1


 
1
from sklearn.model_selection import train_test_split
2
 
          
3
X_trainX_testy_trainy_test = train_test_split(Xytest_size=0.2random_state=0)



Training Text Classification Model and Predicting Sentiment

After dividing data into training and training sets, use the Random Forest Algorithm to perform the next step. You can also use any algorithm of your choice. In the below code snippet, we’ll be implementing the Random Forest Algorithm.

Java
 




xxxxxxxxxx
1


 
1
classifier = RandomForestClassifier(n_estimators=1000random_state=0)
2
 
          
3
classifier.fit(X_train, y_train



So, you have now trained your text classification model with some predictions.

Evaluating The Model

The evaluation of a classification model will be done by the use of metrics. We’ll use classification_report, confusion_matrix, and accuracy_score utilities from the sklearn.metrics library.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Java
 




xxxxxxxxxx
1


 
1
 print(confusion_matrix(y_test,y_pred))
2
 
          
3
print(classification_report(y_test,y_pred))
4
 
          
5
print(accuracy_score(y_testy_pred))



Saving and Loading the Model

The above code is just a smaller code snippet of the example we have taken. In real-time scenarios, there can be a large number of documents. Therefore, you are suggested to save the model once it is trained as you have performed days or months in implementing such algorithms.

Tips to Finalize your Model 

Ensure important considerations when finalizing your machine learning models using scikit-learn library. This includes- Python version, library versions, and manual serialization.

Do you have any concerning questions related to Python or data science or machine learning? Share your queries in the comment section below, I would love to solve your questions. I hope you are clear with the topic now, and soon I will come up with a few more blogs on python. Until then you can find a python blog post at codegnan. 

Topics:
ai, machine learning, python, python library, scikit-learn

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}