Data Challenge 3

As part of the Data Incubation application, we're asked to propose a data science project. As a graduate student in the field of medical research, I'm always on the lookout for ways to improve medical screening and treatment options. In medicine, prevention is important, but is not always possible. In all cases, early diagnosis is almost always results in more favorable prognosis than later ones. In a perfect world, unbiased systems would be able to screen everyone constantly for signs of sickness; however, these types of systems have not been developed yet. Instead, the social sphere is an excellent method of monitoring the health status of people all over the world who open up about their feelings at the moment. This data can be used for a myriad of health-related modeling: what times people tend to start feeling sick, what types of ailments are common, and even feedback about how common remedies or medical interventions are affecting their disease state. However, I envision this data being even more valuable on a public and social health level: these tweets can be used to track the spread of symptoms in cases of disease, enabling medical responders to track down patients that may not recognize their need for medical attention. Groups of symptoms resulting in around the same timeframe in the same region can be used to track spread of infectious diseases, helping efforts at quarantine and control. Current methods of disease control rely on individuals self-reporting symptoms and tracking individuals that may have been in contact with said individual; this data would enable geographical nets of disease spread to be generated in real time based on social media status updates.
For some preliminary data illustrating the potential of this data, I analyzed a list of 1600000 tweets for those related to mental or health states. This preliminary analysis used the word "feel" as a keyword, since people typically use it to describe a mental or physical state. This yielded 4,451 tweets about mental or physical states. Since many people choose social media to direct comments and statuses at certain individuals rather than the general public, I then looked at how many of these tweets were directed at another twitter user. Of the 4,451 tweets, 927 were directed at other individual users. Future analyses can use these other users as "branch points" to gather tweets from these users as well, to gather additional information of health statuses of these users as well to track disease spread in a efficient and real-time manner without waiting for people to self-report to the clinic.

Additionally, I chose to look at the times during which people tend to tweet about their feelings, as this indicates when individuals tend to feel symptoms of either mental status change (i.e. feeling sad, feeling upset) or health change (i.e. onset of headache or nausea). Each of the groups (morning, midday, afternoon, night) were based on 6-hour ranges of time, starting with 5:00 -10:59 A.M. as the morning group. Future analyses with larger datasets can help generate insights about the influence of circadian rhythms, known as the body's "internal biological clock", and mental/physical symptoms. Indeed, a great proportion of tweets about changes in health status seem to occur in the morning or nighttime hours, which is when circadian rhythms may render biological systems more sensitive to manifestation of symptoms.

My hope is that this data will enable proactive, rather than reactive, interventions for disease spread, and further research in the subject of the effects of timing on symptom manifestation and treatment.

For this project, I used java (on eclipse SDK) to import the .csv file of tweets downloaded from Stanford SNAP here.

Comments