Research Update: Week of December 15 — Python

After one week of research and trial and error, I have some more progress to publish/report. I started the week by installing the packages…

Dec 15, 2019

After one week of research and trial and error, I have some more progress to publish/report. I started the week by installing the packages used in the books that I am reading. I wrote a simple shell script to install them all:

#!/bin/bashapt-get update -y;apt-get upgrade -y;apt-get install python-pip python3-pip python-bs4 python3-bs4;apt install apt-transport-https software-properties-common;sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9;sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/';apt-get update -y;apt-get install r-base;apt install gdebi-core;wget  https://download1.rstudio.org/desktop/bionic/amd64/rstudio-1.2.5019-amd64.deb;gdebi  rstudio-1.2.5019-amd64.deb;pip install -r requirements.txt;pip3 install -r requirements.txt;echo "Installation Complete!"

The requirements.txt file is the actual Python packages to install. Here is that file:

requestsjupyterlabnumpytensorflowscikit-learnnltkpandasmatplotlibrebs4

Since the Machine Learning and Natural Language Processing books are longer with longer chapters, I am currently pacing myself to complete one section per week right now. This allows me to ingest more material across multiple disciplines while continuing to make progress with my book, write other blogs, and hold down my day job.

Python Learning

Natural Language Processing

Starting with the other NLP, Natural Language Processing, not Neuro-Linguistic Programming, I installed the Natural Language Toolkit (nltk). At first, I was running the commands from the book in the python3 interpreter; then, I decided to write some scripts and save them as files.

At the beginning of this book (Natural Language Processing with Python), I am spending a lot of time using the sample dataset from the nltk, which can be installed via:

from nltk.book import *

This provides nine different literary works to analyze. The book starts with basic math and arithmetic. Next is starting to analyze the words within the files (named text1-text9). We begin by creating a concordance, which is a list of specific words used in alphabetical order and number of occurrences. I can see where this will be important in later projects because of the context this provides. To get a concordance of a file:

<filename>.concordance("<word_searched_for")

We can also do a similar search with the word similar instead of concordance. This will find words like the one searched for. I ran this with the word “gazed” and it showed me lines (with the context) containing the phrase “stared” and “discovered.”

Using common_context, we can see instances where the words share a similar context.

<filename>.common_context(["<word1>", "<word2>"])

Here is where things could get scary. The generate function. It will analyze the text input and create a passage of its own using ngrams. Imagine the possibilities for this if using it for disinformation, deception, or deep fakes.

Next, we work with the len() command to determine the length of our text. We then move to sorting using sorted() and counts of occurrences of words. We can accomplish this via:

<filename>.count("<word>")

Next, we get into lists: creating them, adding two lists together, and appending to them. We move into indexing lists via:

<filename>.index("<word>") Indexes for the word <word>.<filename>[8675309:] Indexes all items in the list after 8675309 (note the list starts at 0).<filename>[:8675309] Indexes all items in the list before (not including) 8675309 .<filename>[867:5309] Indexes all items in the list from item 867 to 5308.<filename>[6] Item number 6 in the list.

Modifying Lists:<list>[6] = 'six' Will write 'six' as the 6th item in the list or replace the 6th item with 'six'

Next up is variables then strings. We start working with Frequency Distribution (this could come in handy with password cracking BTW). This works through the frequency of the words, the length of the words, and a combination of the two. This is accomplished using the FreqDist package within nltk.

The next section has me working with Python logic and conditionals.

<      Less Than<=     Less Than or Equal To==     Equal To!=     Not equal To>      Greater Than>=     Greater Than or Equal To

Conditional patterns of use are:

<name>.startswith(<letter>)<name>.endswith(<letter>)<letter> in <name><name>.islower() Checks if all characters are lowercase <name>.isupper() Checks if all characters are uppercase<name>.isaplha() Checks if all characters are alphabetic<name>.isalnum() Checks if all characters are alphanumeric<name>.isdigit() Checks if all characters are digits (numbers)<name>.istitle() Checks if all characters are in Title Case (All initial capitals)

Here is an example where I input a chapter of my upcoming book into the interpreter and looking for words ending in “ishing:”

>>> sorted([w for w in set(book) if w.endswith('ishing')])['Phishing', 'accomplishing', 'phishing', 'vishing']

This could be useful in determining variations that people use on passwords.

Machine Learning (with SciKit-Learn and Tensorflow)

Much of the first chapter of the book (Hands-On Machine Learning with Scikit-Learn and Tensorflow) is indoctrination to how Machine Learning works. This takes us through supervised and unsupervised learning, data quality, underfitting vs overfitting, instance-based learning versus model-based learning, and batch versus online learning to name a few.

I didn’t really write any code or learn anything relevant to python in this chapter. To me, this was a solid refresher in what I learned while working on a graduate certificate. I was able to pick up where I left off with aspects of cluster analysis and regression, which was a relief in a sense.

Conclusion

Next week is a busy week. In addition to my day job, I will be on an ITSP Magazine podcast on Monday and Paul’s Security Weekly on Thursday as part of a penetration testing panel. I hope to meet my self-imposed quotas, but we will see where I land. I will check back in next week.

AT&T Cybersecurity (formerly AlienVault), published my Which Security Certification Is Right For You (if any) this week. I have upcoming pieces about to be released by TripWire and ITSP forthcoming, both of which have a Home Alone theme. Stay tuned for those. I will also be writing something for work, so watch out for that as well.

The OSINTion Tidbit

Discussion about this post