JCU Logo

James Cook University Subject Handbook - 2023

For subject information from 2025 and onwards, please visit the new JCU Course and Subject Handbook website.

MA3831 - Natural Language Processing, Web Scraping and Large Data Processing

Credit points:03
Year:2023
Student Contribution Band:Band 1
Prerequisites:CP1404 AND MA3405
Administered by:College of Science and Engineering

Subject Description

    This subject will provide students with cutting-edge tools and techniques for data science. There are two parts to this subject. In the first half of the subject, student will explore natural language processing (NLP), web scraping and APIs to harvest data with Python and explore the data science workbench approach to managing production pipelines of work that can be re-used in different data science projects. In the second half of the subject, student will focus on computer models and software designed to handle Big Data sets in a distributed and/or parallel fashion. Particular focus will be given to distributed and parallel computing using Map-Reduce/Hadoop and similar models for processing Big Data sets.

Learning Outcomes

  • understand and apply new data science skills, knowledge and techniques to solve problems in data science using NLP
  • apply data science skills, knowledge and techniques to solve problems in data science NLP projects with a focus on web scraping
  • understand how to deploy data science projects into production pipelines
  • compare and evaluate different systems and approaches for high-performance and large-scale computing for analytics for standard data and big data
  • manage and prepare data using standard management frameworks for the purpose of transforming, cleaning to ensuring classical characteristic outcomes are achieved
  • examine and deploy data processing tasks in the Hadoop ecosystem for big data

Subject Assessment

  • Written > Case report 1 - (20%) - Individual
  • Written > Project report - (50%) - Individual
  • Written > Technical report - (30%) - Individual

Note that minor variations might occur due to the continuous subject quality improvement process, and in case of minor variation(s) in assessment details, the Subject Outline represents the latest official information.

Availabilities

Cairns Nguma-bada, Study Period 1, Internal

Census date:Thursday, 23 Mar 2023
Study Period Dates:Monday, 20 Feb 2023 to Friday, 16 Jun 2023
Coordinator(s):
DR David Donald
Lecturer(s):
DR David Donald
DR Carla Ewels
DR Nathan White
Workload expectations:The student workload for this 3 credit point subject is approximately 130 hours.
  • 52 Hours - Workshops

Cairns Nguma-bada, Study Period 86, Internal

Census date:Thursday, 09 Nov 2023
Study Period Dates:Monday, 30 Oct 2023 to Friday, 15 Dec 2023
Coordinator(s):
DR David Donald
Lecturer(s):
DR David Donald
Workload expectations:The student workload for this 3 credit point subject is approximately 130 hours.
  • 52 Hours - Workshops

JCU Singapore, Study Period 52, Internal

Census date:Thursday, 03 Aug 2023
Study Period Dates:Monday, 10 Jul 2023 to Friday, 13 Oct 2023
Coordinator(s):
DR David Donald
Lecturer(s):
DR Eric Tham
Workload expectations:The student workload for this 3 credit point subject is approximately 130 hours.
  • 26 Hours - Online activity
  • 26 Hours - Online Workshops

Townsville Bebegu Yumba, Study Period 1, Internal

Census date:Thursday, 23 Mar 2023
Study Period Dates:Monday, 20 Feb 2023 to Friday, 16 Jun 2023
Coordinator(s):
DR David Donald
Lecturer(s):
DR David Donald
DR Carla Ewels
DR Nathan White
Workload expectations:The student workload for this 3 credit point subject is approximately 130 hours.
  • 26 Hours - Online activity
  • 26 Hours - Online Workshops