JCU Logo

James Cook University Subject Handbook - 2022

For subject information from 2025 and onwards, please visit the new JCU Course and Subject Handbook website.

MA3831 - Natural Language Processing, Web Scraping and Large Data Processing

Credit points:03
Year:2022
Student Contribution Band:Band 1
Prerequisites:CP1404 AND MA3405
Administered by:College of Science and Engineering (pre 2023 PCS)

Subject Description

    This subject will provide students with cutting-edge tools and techniques for data science. There are two parts to this subject. In the first half of the subject, student will explore natural language processing (NLP), web scraping and APIs to harvest data with Python and explore the data science workbench approach to managing production pipelines of work that can be re-used in different data science projects. In the second half of the subject, student will focus on computer models and software designed to handle Big Data sets in a distributed and/or parallel fashion. Particular focus will be given to distributed and parallel computing using Map-Reduce/Hadoop and similar models for processing Big Data sets.

Learning Outcomes

  • understand and apply new data science skills, knowledge and techniques to solve problems in data science using NLP
  • apply data science skills, knowledge and techniques to solve problems in data science NLP projects with a focus on web scraping
  • understand how to deploy data science projects into production pipelines
  • compare and evaluate different systems and approaches for high-performance and large-scale computing for analytics for standard data and big data
  • manage and prepare data using standard management frameworks for the purpose of transforming, cleaning to ensuring classical characteristic outcomes are achieved
  • examine and deploy data processing tasks in the Hadoop ecosystem for big data

Subject Assessment

  • Written > Case report 1 - (20%) - Individual
  • Written > Project report - (50%) - Individual
  • Written > Technical report - (30%) - Individual

Note that minor variations might occur due to the continuous subject quality improvement process, and in case of minor variation(s) in assessment details, the Subject Outline represents the latest official information.

Availabilities

Cairns, Study Period 1, Internal

Census date:Thursday, 24 Mar 2022
Study Period Dates:Monday, 21 Feb 2022 to Friday, 17 Jun 2022
Coordinator(s):
DR David Donald
Lecturer(s):
DR David Donald
DR Carla Ewels
DR Nathan White
Workload expectations:The student workload for this 3 credit point subject is approximately 130 hours.
  • 52 Hours - Workshops

JCU Singapore, Study Period 51, Internal

Not expected to be available after this year.

Census date:Thursday, 07 Apr 2022
Study Period Dates:Monday, 14 Mar 2022 to Friday, 17 Jun 2022
Lecturer(s):
DR Eric Tham
Workload expectations:The student workload for this 3 credit point subject is approximately 130 hours.
  • 26 Hours - Pre-recorded content/lectures
  • 26 Hours - Online workshops

JCU Singapore, Study Period 53, Internal

Not expected to be available after this year.

Census date:Thursday, 01 Dec 2022
Study Period Dates:Monday, 07 Nov 2022 to Friday, 17 Feb 2023
Coordinator(s):
DR David Donald
Lecturer(s):
DR Eric Tham
Workload expectations:The student workload for this 3 credit point subject is approximately 130 hours.
  • 26 Hours - Pre-recorded content/lectures
  • 26 Hours - Online workshops

Townsville, Study Period 1, Internal

Census date:Thursday, 24 Mar 2022
Study Period Dates:Monday, 21 Feb 2022 to Friday, 17 Jun 2022
Coordinator(s):
DR David Donald
Lecturer(s):
DR David Donald
DR Carla Ewels
DR Nathan White
Workload expectations:The student workload for this 3 credit point subject is approximately 130 hours.
  • 26 Hours - Pre-recorded content/lectures
  • 26 Hours - Online workshops