BigData Jay Gu Jan 10 HW1 preview and Java Review Outline HW1 preview Review of java basics An example of gradient descent for linear regression in Java HW1 Preview On 1 million size data ID: 381947
Download Presentation The PPT/PDF document "Recitation for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Recitation for BigData
Jay GuJan 10
HW1 preview and Java ReviewSlide2
OutlineHW1 preview
Review of java basicsAn example of gradient descent for linear regression in JavaSlide3
HW1 Preview
On ~1 million size data.Warm up exercise
Stochastic Gradient Descent for Logistic Regression
SGD with Hashing Kernel
Extra credit: Personalized Logistic RegressionSlide4
Starter Code
Class for parsing the input file and iterate over the dataset.Dataset dataset = new Dataset(
your_path
,
is_training
, size)
While(
d
ataset.hasNext
()) {
DataInstance
d =
d
ataset.next
();
… some action on d …
}Slide5
Starter Code
public class DataInstance
{
int
clicks; // number of clicks, -1 if it is testing data.
int
impressions; // number of impressions, -1 if it is testing data.
// Feature of the session
int
depth; // depth of the session.
int
[] query; // List of token
ids in the query field
// Feature of the ad
….
// Feature of the user
….
}Slide6
Starter Code
public class Weights { double w0;
/*
*
query.get
("123") will return the weight for the feature:
* "token 123 in the query field".
*/
Map<Integer, Double> query;
Map<Integer, Double> title;
Map<Integer, Double> keyword;
Map<Integer, Double> description;
double
wPosition
;
double
wDepth
;
double
wAge
;
double
wGender
;
}Slide7
BigData is often sparse
Be as lazy as you can …
Update only when necessary…Slide8
Avoid O(d): Sparse and lazy update
Although the feature space d is huge, each data point only has a few tokens.Only update what is changed.
But even so, regularization should be applied to all d weights at each step.
Delay and batch the regularization.Slide9
Java Review
Not required but good to know: Interface, Inheritance, Access Modifier,
I/O,
…
Language: Class, Object, variable, method
Data Structure: Java Collections
Array
List :
ArrayList
Map:
HashMapSlide10
Class
public class DataInstance
{
// Feature of the session
int
[] query ….
// Feature of the ad
int
[] title …
DataInstance
(String line, … ) {
// parse the line, and set the field
}
public void print() {
System.out.println
( “title: “); for (int token : title) System.out.print(token + “\t”); } }
Members or fields
Constructor
MethodSlide11
Object
DataInstance data = new DataInstance();
int
clicked =
data.clicked
data.print
()Slide12
Collections
Arrayint[] tokensdouble[] weights
ArrayList
ArrayList
<
DataInstance
>
HashMap
HashMap
<K, V>
Fixed Length, Most compact
Dynamically Increasing (double the size every time)
Constant time key value look up
Dynamically Increasing, use more memorySlide13
Variables
“Everything” in Java is an ObjectExcept for primitive types : int
, double
All object variables are reference/pointers to the Object
F
unction passes variables by valueSlide14
Example: SGD for linear regression
Demo