jleetutorial
diff --git a/‎3_advanced/.ipynb_checkpoints/1_Stream-Stream Join Demo-checkpoint.ipynb
Lines changed: 75 additions & 27 deletions b/‎3_advanced/.ipynb_checkpoints/1_Stream-Stream Join Demo-checkpoint.ipynb
Lines changed: 75 additions & 27 deletions
diff --git a/‎3_advanced/.ipynb_checkpoints/2_Stream-Dataset Join Demo-checkpoint.ipynb
Lines changed: 87 additions & 32 deletions b/‎3_advanced/.ipynb_checkpoints/2_Stream-Dataset Join Demo-checkpoint.ipynb
Lines changed: 87 additions & 32 deletions
diff --git a/‎3_advanced/.ipynb_checkpoints/3_Join Operations Exercise-checkpoint.ipynb
Lines changed: 77 additions & 14 deletions b/‎3_advanced/.ipynb_checkpoints/3_Join Operations Exercise-checkpoint.ipynb
Lines changed: 77 additions & 14 deletions
@@ -7,18 +7,6 @@
     "# Stream-Stream Join Demo"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Different types of Join\n",
-    "Stream-stream joins\n",
-    "Stream-dataset joins\n",
-    "DEMO: Do a demo with Stream-stream joins\n",
-    "DEMO: Do a demo with Stream-dataset joins\n",
-    "EXERCISE: Give an exercise with Stream-stream joins or Stream-dataset joins\n"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -40,20 +28,7 @@
     "windowedStream1 = stream1.window(20)\n",
     "windowedStream2 = stream2.window(60)\n",
     "joinedStream = windowedStream1.join(windowedStream2)\n",
-    "```\n",
-    "\n",
-    "### Stream-dataset joins\n",
-    "\n",
-    "This has already been shown earlier while explain `DStream.transform` operation. Here is yet another example of joining a windowed stream with a dataset.\n",
-    "```python\n",
-    "dataset = ... # some RDD\n",
-    "windowedStream = stream.window(20)\n",
-    "joinedStream = windowedStream.transform(lambda rdd: rdd.join(dataset))\n",
-    "```\n",
-    "In fact, you can also dynamically change the `dataset` you want to join against. The function provided to `transform` is evaluated every batch interval and therefore will use the current dataset that `dataset` reference points to.\n",
-    "\n",
-    "The complete list of DStream transformations is available in the API documentation. For the Python API, see [DStream](https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream).\n",
-    "\n"
+    "```"
    ]
   },
   {
@@ -81,7 +56,80 @@
     "collapsed": true
    },
    "outputs": [],
-   "source": []
+   "source": [
+    "from pyspark import SparkContext\n",
+    "from pyspark.streaming import StreamingContext\n",
+    "from pprint import pprint\n",
+    "import time"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sc = SparkContext()\n",
+    "ssc = StreamingContext(sc, 1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rdd_queue = []\n",
+    "for i in xrange(5): \n",
+    "    rdd_data = xrange(1000)\n",
+    "    rdd = ssc.sparkContext.parallelize(rdd_data)\n",
+    "    rdd_queue.append(rdd)\n",
+    "pprint(rdd_queue)\n",
+    "\n",
+    "# Creating queue stream # 1\n",
+    "ds1 = ssc.queueStream(rdd_queue).map(lambda x: (x % 10, 1)).window(4).reduceByKey(lambda v1,v2:v1+v2)\n",
+    "ds1.pprint()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Creating queue stream # 2\n",
+    "ds2 = ssc.queueStream(rdd_queue).map(lambda x: (x % 5, 1)).window(windowDuration=20).reduceByKey(lambda v1,v2:v1+v2)\n",
+    "ds2.pprint()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Crossing the Streams\n",
+    "joinedStream = ds1.join(ds2)\n",
+    "joinedStream.pprint()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ssc.start()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ssc.stop()"
+   ]
   },
   {
    "cell_type": "markdown",
 
@@ -11,37 +11,6 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Different types of Join\n",
-    "Stream-stream joins\n",
-    "Stream-dataset joins\n",
-    "DEMO: Do a demo with Stream-stream joins\n",
-    "DEMO: Do a demo with Stream-dataset joins\n",
-    "EXERCISE: Give an exercise with Stream-stream joins or Stream-dataset joins\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Join Operations\n",
-    "\n",
-    "Finally, its worth highlighting how easily you can perform different kinds of joins in Spark Streaming.\n",
-    "\n",
-    "### Stream-stream joins\n",
-    "\n",
-    "Streams can be very easily joined with other streams.\n",
-    "```python\n",
-    "stream1 = ...\n",
-    "stream2 = ...\n",
-    "joinedStream = stream1.join(stream2)\n",
-    "```\n",
-    "Here, in each batch interval, the RDD generated by `stream1` will be joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, `rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do joins over windows of the streams. That is pretty easy as well.\n",
-    "```python\n",
-    "windowedStream1 = stream1.window(20)\n",
-    "windowedStream2 = stream2.window(60)\n",
-    "joinedStream = windowedStream1.join(windowedStream2)\n",
-    "```\n",
-    "\n",
     "### Stream-dataset joins\n",
     "\n",
     "This has already been shown earlier while explain `DStream.transform` operation. Here is yet another example of joining a windowed stream with a dataset.\n",
@@ -74,14 +43,100 @@
     "findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark import SparkContext\n",
+    "from pyspark.streaming import StreamingContext\n",
+    "import time"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sc = SparkContext(\"local[2]\", \"IP-Matcher\")\n",
+    "ssc = StreamingContext(sc, 2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ips_rdd = sc.parallelize(set())\n",
+    "lines_ds = ssc.socketTextStream(\"localhost\", 9999)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# split each line into IPs\n",
+    "ips_ds = lines_ds.flatMap(lambda line: line.split(\" \"))\n",
+    "pairs_ds = ips_ds.map(lambda ip: (ip, 1))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# join with the IPs RDD\n",
+    "matches_ds = pairs_ds.transform(lambda rdd: rdd.join(ips_rdd))\n",
+    "matches_ds.pprint()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# In another window run:\n",
+    "# $ nc -lk 9999\n",
+    "# Then enter IP addresses separated by spaces into the nc window"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
-   "source": []
+   "source": [
+    "ssc.start()\n",
+    "\n",
+    "# alternate between two sets of IP addresses for the RDD\n",
+    "IP_FILES = ('data/ip_file1.txt', 'data/ip_file2.txt')\n",
+    "file_index = 0\n",
+    "while True:\n",
+    "    with open(IP_FILES[file_index]) as f:\n",
+    "        ips = f.read().splitlines()\n",
+    "    ips_rdd = sc.parallelize(ips).map(lambda ip: (ip, 1))\n",
+    "    print \"using\", IP_FILES[file_index]\n",
+    "    file_index = (file_index + 1) % len(IP_FILES)\n",
+    "    time.sleep(8)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ssc.stop()"
+   ]
   },
   {
    "cell_type": "markdown",
 
@@ -7,18 +7,6 @@
     "# Join Operations Exercise"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Different types of Join\n",
-    "Stream-stream joins\n",
-    "Stream-dataset joins\n",
-    "DEMO: Do a demo with Stream-stream joins\n",
-    "DEMO: Do a demo with Stream-dataset joins\n",
-    "EXERCISE: Give an exercise with Stream-stream joins or Stream-dataset joins\n"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -60,7 +48,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Exercise"
+    "### Exercise\n",
+    "Create a streaming app that can join the incoming orders with our previous knowledge of whether this customer is good or bad."
    ]
   },
   {
@@ -74,14 +63,88 @@
     "findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark import SparkContext\n",
+    "from pyspark.streaming import StreamingContext\n",
+    "import time"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sc = SparkContext()\n",
+    "ssc = StreamingContext(sc, 1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# For testing, create prepopulated QueueStream of streaming customer orders. \n",
+    "transaction_rdd_queue = []\n",
+    "for i in xrange(5): \n",
+    "    transactions = [(customer_id, None) for customer_id in xrange(10)]\n",
+    "    transaction_rdd = ssc.sparkContext.parallelize(transactions)\n",
+    "    transaction_rdd_queue.append(transaction_rdd)\n",
+    "transaction_rdd_queue.pprint()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Batch RDD of whether customers are good or bad. \n",
+    "# (customer_id, is_good_customer)\n",
+    "customers = [(0,True),(1,False), (2,True), (3,False), (4,True), (5,False), (6,True), (7,False), (8,True), (9,False)]\n",
+    "customer_rdd = ssc.sparkContext.parallelize(customers)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Creating queue stream\n",
+    "ds = ssc.queueStream(transaction_rdd_queue)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
-   "source": []
+   "source": [
+    "# Join the streaming RDD and batch RDDs to filter out bad customers.\n",
+    "dst = ds.transform(lambda rdd: rdd.join(customer_rdd)).filter(lambda (customer_id, (customer_data, is_good_customer)): is_good_customer)\n",
+    "## END OF EXERCISE SECTION ==================================\n",
+    "dst.pprint()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ssc.start()\n",
+    "time.sleep(6)\n",
+    "ssc.stop()"
+   ]
   },
   {
    "cell_type": "markdown",