20
loading...
This website collects cookies to deliver better user experience
Study Hours | Result |
---|---|
2 | Fail |
3 | Fail |
5 | Fail |
7 | Fail |
10 | Fail |
11 | Fail |
12 | Fail |
13 | Pass |
14 | Pass |
16 | Pass |
17 | Fail |
18 | Pass |
20 | Pass |
22 | Pass |
23 | Pass |
.csv
file like this. In my case, book.csv
is the file name.!pip install numpy
!pip install pandas
!pip install matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
book.csv
filedata = pd.read_csv("book.csv")
hours = np.array(data['hours'].values)
results = np.array(data['result'].values)
plt.scatter(hours, results, color='green')
plt.xlabel("Hours")
plt.show()
# m - slope
# b - intercept
m,b = np.polyfit(hours,results,1)
plt.xlabel("Hours")
plt.plot(hours, results, 'o', color='green')
plt.plot(hours,m*hours+b)
y=0.5
, we can see something wrong in the linear regression. It's not a fair line as the previous one.y=0.5
, more than 0.5 (y>0.5
) are passed students, and lower than 0.5 (y<0.5
) are failed students. Also, we can dismiss some data points that I marked in the graph below because those will occur rarely.!pip install scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# for divide data set to train data and test data
from sklearn.model_selection import train_test_split
# logistic regression model
from sklearn.linear_model import LogisticRegression
book.csv
file using pandasdata = pd.read_csv("book.csv")
# x_data as 2d array
x_data = data[['hours']]
y_data = data['result']
random_state=2
parameter to prevent the data changes by random. In your case, you can use any number or dismiss it. Also, you can add the test_size
parameter to change the percentage of the test data set if you want. (default - 0.25)x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, random_state=2)
len(x_train)
and len(x_test)
, you can see the length of those data sets. in my case, x_train length is 11, x_test length is 4.model = LogisticRegression()
model.fit(x_train, y_train)
model.predict(x_test)
# predicted result - array([1, 0, 0, 0], dtype=int64)
y_test
. For me, the result is,11 1
4 0
5 0
0 0
Name: result, dtype: int64
model.predict([[6], [15], [19], [25]])
# predicted result - array([0, 1, 1, 1], dtype=int64)
Study Hours | Result |
---|---|
6 | Fail |
15 | Pass |
19 | Pass |
25 | Pass |
model.score(x_test, y_test)
# 1.0
!pip install seaborn
regplot
it with book.csv
data.import seaborn as sns
sns.regplot(x='hours', y='result', data=data, logistic=True)