A Navid Mashinchi Guide For Utilising Spark ML and Spark Streaming
Navid Mashinchi has used his skillsets to expertly explain several complex and complicated concepts to his followers and students around the globe. One such example is his exploratory tutorial on using Spark ML to make predictions on streaming data. The tutorial assumes a naturally occurring familiarity with the Spark program from the get-go.
His example revolves around the notion of predicting whether someone will have a heart attack based on age, gender, and medical conditions.
Data Collection
The data that was used in the Navid Mashinchi example was found on Kaggle, with 303 rows and 14 columns denoting the information of each individual patient for the example. In his example, he creates a schema to ensure that the correct data is being read in the file.
The raw data is then filtered through Spark to adequately collate and organise the data according to the instructions given. Adjusting the code to ensure that the resulting dataset is readable and correct.
Machine Learning
Navid Mashinchi then explains the data processing steps and how he splits into a 70/30 set, where 70% is training and 30% is test.
He goes on to explain his reasonings and fully articulates the necessity of creating his pipeline, breaking it down into five easily digestible stages. From the vector assembler, scaling process, one-hot encoding, secondary vector assembler creation, and logistic regression. The underlying processes are fully explained and helpfully mapped out.
The logistic regression is explained as usable due to the target consisting of binary numbers (1s and 0s). The tutorial then goes through the pipeline using the initial training set, Navid Mashinchi helpfully includes a range of detailed images to allow an easily followed model can be established.
After exploring the resulting predictor scores, a dependable and satisfying accuracy score can then be attained.
Streaming
Now comes the fun part, the incorporation of Spark Streaming into the equation. Navid Mashinchi reconstitutes the initial test data into separate files to emulate the streaming sim. After creating a source, Mr. Mashinchi then adds the schema from the beginning of the process when reading in the files, adding additional measures to streamline results.
After showing us through the final stages of prep and setting up the streaming of the test data fully, it all comes together nicely. The result being the streaming in unseen data which stemmed from the repartitioned test data for replication purposes.
Navid Mashinchi then articulates the way in which the resulting data can be interpreted using the facilities he outlined throughout the tutorial.
Final Thoughts
Navid Mashinchi shows off the manner in which concepts can be explained through practice in lieu of simple jargon being thrown across the page. The point of the tutorial was not to showcase the exactitude of the modelling data, but rather the process of using unseen streamed data.