STA 142A Statistical Learning I

Discussion 1: Introduction to Python Programming

TA: Tesi Xiao

Course Overview

  • Python is the only programming language used in this class. If you are unfamiliar with the basic syntax (like the control flow tools), please quickly go over Section 1-6 in Python tutorials in the first several weeks. Also, a sense of objected-oriented programming (OOP) is preferred. If you want to learn OOP in Python, take time to read Section 9: Classes.

  • In addition to the thoery part, the course will involve basic Python programming including:
    • Data Manipulation: Numpy, Pandas
    • Machine Learning tools: scikit-learn
    • Visualization: matplotlib
  • The main focus of this course is to provide motivated students with a conceptual and practical understanding of statistical learning techniques. This is not a machine learning or deep learning course. We will not cover too much technical details.

Python Environment Setup

For this class, installing Anaconda for setting up your programming environment is strongly recommended.

Several tools you may use for programming:

  • Jupyter Notebook: a great platform to organize your project with Markdown and Python. We will use this to submit homework assignments.
  • PyCharm: a great IDE for Python.

(Advanced) Text Editors you can use for coding:

Markdown Syntax Markdown Cheatsheet

Python Basic Syntax

Please quick review all the listed commands.

  • Operators: +, -, *, **, /, //, =, ==, is, in, …
  • Control flow tools: if...elif...else..., for, while, break, continue, pass, …

Python Basic Data Structures

To warm up, we summarize several basic data structures in Python. Below are four commonly-used structures.

1. Lists

# list
a = []
print("Empty: "+str(a))

a.append('a')
a.append('c')
print("After appending: "+str(a))

a.insert(1,'b')
print("After inserting: "+str(a))

a.pop()
print("After popping: "+str(a))

# .... other methods for Python basic lists (sort, remove, ...)

# index starts from 0 (unlike R)
print("a[0] = "+str(a[0]))
print("a[1] = "+str(a[1]))
Empty: []
After appending: ['a', 'c']
After inserting: ['a', 'b', 'c']
After popping: ['a', 'b']
a[0] = a
a[1] = b

2. Tuples: immutable lists which are faster and consume less memory

b = (1,2,2,3)

# count the number of occurence of a value
b.count(2)
2

3. Dictinaries (Hashmap)

# dictionaries contain a mapping from keys to values (fast)
d = {'first':'string value', 'second':[1,2]}

print("keys:"+str(d.keys()))
print("values:"+str(d.values()))
keys:dict_keys(['first', 'second'])
values:dict_values(['string value', [1, 2]])

4. Sets

a = set([1, 2, 3, 4])
b = set([3, 4, 5, 6])

print("a | b: "+ str(a | b ))# Union

print("a & b: "+ str(a & b )) # Intersection
a | b: {1, 2, 3, 4, 5, 6}
a & b: {3, 4}

Advanced Data Structures from Packages

For example,

Module and Package

Python is open. Python is developed under an OSI-approved open source license, making it freely usable and distributable, even for commercial use. Everyone can contribute to this community, such as developing useful modules.

It is much more convenient to manage your programming environment using conda. Read the user guide

import math

math.sqrt(2)

import math as m

m.sqrt(2)
1.4142135623730951
from math import sqrt

# from math import *

sqrt(2)
1.4142135623730951

Below are some useful modules you will use in this class.

  • numpy: The fundamental package for scientific computing with Python
  • scikit-learn: Machine Learning in Python
  • matploblib: Visualization with Python
  • seaborn: Statistical data visualization
  • statsmodels: statistical models, hypothesis tests, and data exploration
  • PyTorch: An open source machine learning framework that accelerates the path from research prototyping to production deployment

Function

You can define your own function or call functions from modules.

# define your own functions
def myfunction(a):
    b = a
    return b
# Functions from the module
math.sqrt(3)

np.array([1,2,3])
array([1, 2, 3])

Class, Object and its methods (OOP)

Object-oriented programming (OOP) is a programming paradigm based on the concept of “objects”, which can contain data and code: data in the form of fields (often known as attributes or properties), and code, in the form of procedures (often known as methods).

import numpy as np
# Create a np.array object
ar = np.array([1,2,3])

# Check all the attributes and methods of this object
dir(ar)
['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__xor__',
 'all',
 'any',
 'argmax',
 'argmin',
 'argpartition',
 'argsort',
 'astype',
 'base',
 'byteswap',
 'choose',
 'clip',
 'compress',
 'conj',
 'conjugate',
 'copy',
 'ctypes',
 'cumprod',
 'cumsum',
 'data',
 'diagonal',
 'dot',
 'dtype',
 'dump',
 'dumps',
 'fill',
 'flags',
 'flat',
 'flatten',
 'getfield',
 'imag',
 'item',
 'itemset',
 'itemsize',
 'max',
 'mean',
 'min',
 'nbytes',
 'ndim',
 'newbyteorder',
 'nonzero',
 'partition',
 'prod',
 'ptp',
 'put',
 'ravel',
 'real',
 'repeat',
 'reshape',
 'resize',
 'round',
 'searchsorted',
 'setfield',
 'setflags',
 'shape',
 'size',
 'sort',
 'squeeze',
 'std',
 'strides',
 'sum',
 'swapaxes',
 'take',
 'tobytes',
 'tofile',
 'tolist',
 'tostring',
 'trace',
 'transpose',
 'var',
 'view']

Also, check out the documentations for detailed intructions of all these methods.

# Call the min() method
ar.min() # ar.max()
1