Web-browsing patterns reflect and shape mood and mental health

admin

2 years ago

Web-browsing patterns reflect and shape mood and mental health

Ethical approval was provided by the Research Ethics Committee at University College London and all participants gave their informed consent to participate. Each subject participated in only one of the studies reported.

Study 1

Participants

Three hundred and twelve participants were recruited through the Prolific platform. Participants were recruited from the UK and US and were 18 years or older. Sample sizes were determined based on a pilot study to achieve a power of 0.95 (α = 0.05), using G*Power⁸⁹. Data from 25 participants whose browsing did not result in at least 1 KB of text from at least three webpages each day were not analysed further. Thus, data for 287 participants were analysed (age, 33.17 years, s.d. 11.71; 50.5% females, 48.1% males, 1.4% other). Of those, 171 participants also completed state mood ratings. Data of five participants who indicated that they submitted archived browsing history were not included in mood analysis, because their current mood ratings obviously could not be temporally associated with their submitted browsing data, leaving n = 164 for mood analysis (age, 33.23 years, s.d. 11.62; 52.4% females, 47.6% males, 0% other). All participants received £7.50 for their participation on day 1 and £3.25 on each of days 2–5.

Procedure

Data collection

Participants were asked to browse the internet for 20 minutes a day for 5 days using Mozilla Firefox and then to submit their internet search history for this period (see Supplementary Materials for study instructions). We used Mozilla Firefox because it was the only browser, to our knowledge, that allowed users to extract the exact URL they visited with relative ease from their browsing history. We extracted the paragraph text from each webpage, denoted by <p> in the webpage’s html code, using the ‘rvest’ package in RStudio. We then cleaned the text by removing extraneous information such as punctuation, symbols (for example, @, #), emojis, links (URLs) and all other non-alphanumeric characters (similar to Kelley and Gillan²²). Participants were asked to browse the internet during non-work hours so that their web-browsing behaviour would not reflect mandatory work-related tasks. All consecutive duplicate webpages were removed from the analysis. Being mindful that content can change quickly, we made sure to extract the vast majority of the text from webpages within 24 hours (and at the very most 36 hours) of the time the participant visited the page. However, it is still possible that some content may have changed during this time. Therefore, we ran a validation to check for the stability of webpages’ valence scores (see Supplementary Analysis 3 for details). This validation confirmed that the valence of webpages remain highly stable across consecutive days (negative score: r(998) = 0.991, 95% CI 0.990, 0.992, P < 0.001; positive score r(998) = 0.991, 95% CI 0.989, 0.992, P < 0.001).

Selecting the method to measure text valence

There are many validated methods to score text on sentiment (valence). These include machine-learning methods^90,91,92 and ‘bag of words’ (lexicon) approaches that are developed by asking large groups of people to rate words on specific dimensions^64,65. We first tested whether these different methods provide consistent scores for participants. We selected two popular lexicons—the NRC VAD lexicon⁶⁴ and the Hu and Liu Opinion lexicon⁶⁵—and a state-of-the-art large language machine-learning model, the distilbert-base-uncased-finetuned-sst-2-english (DistilBERT⁶³), which is fine-tuned for sentiment analysis tasks. For each webpage, the DistilBERT model provides probabilities representing the likelihood of the text expressing positive or negative sentiment. These probabilities were the positive and negative sentiment scores used for the large language model. For the NRC VAD lexicon⁶⁴, the valence of each word is categorized on a scale from 0 (most negative) to 1 (most positive). In line with Kiritchenko and colleagues⁹³, we computed the percentage of words with a positive valence score ≥0.75 (2,668 terms; for example, ‘delicious’ and ‘admire’) and percentage of words with a negative valence score ≤0.25 (3,081 terms; for example, ‘despise’ and ‘danger’), of all words contained in the extracted text of each webpage visited for each of the 5 days. For the Hu and Liu method⁶⁵, the lexicon contains separate positive and negative word lists, with no weighting applied as in the NRC method. We calculated the percentage of positive and negative words from all words contained in the extracted text of each webpage visited over the same 5-day period.

We used each method separately to score all webpages visited by the first 100 participants from study 1 and averaged the webpage scores for each participant. We used an ICC analysis to examine how consistent the scores were across different scoring methods, separately for positive and negative scores. All scores were Z-scored before analysis to ensure they were on the same scale. We observed good reliability between all three methods: (1) the NRC VAD lexicon and the Hu and Liu Opinion lexicon (positive score: ICC = 0.835, P < 0.001; negative score: ICC = 0.948, P < 0.001); (2) the NRC VAD lexicon and the DistilBERT algorithm (positive score: ICC = 0.812, P < 0.001; negative score: ICC = 0.869, P < 0.001); and (3) the DistilBERT algorithm and the Hu and Liu Opinion lexicon (positive score: ICC = 0.866, P < 0.001; negative score: ICC = 0.885, P < 0.001). This suggests that these different methods measure the same construct.

We further checked that the above scores were reflective of human assessment. To that end, we asked a fresh set of participants (n = 100) to rate the positive (0 (not at all) to 6 (very positive)) and negative (0 (not at all) to 6 (negative)) valence of 10 randomly assigned webpages from a corpus of 48 webpages. We then computed the positive and negative valence scores for each webpage using the methods above and their respective human rating for that webpage and submitted the positive and negative pairs into an ICC to calculate their reliability. All scores were Z-scored to standardize them for comparison. The values were found to be significantly related: the human ratings were significantly related with the NRC valence scores (negative score: ICC = 0.707, 95% CI 0.668, 0.742, P < 0.001; positive score: ICC = 0.499, 95% CI 0.432, 0.558, P < 0.001), with the Hu and Liu valence scores (negative score: ICC = 0.680, 95% CI 0.668, 0.742, P < 0.001; positive score: ICC = 0.510, 95% CI 0.432, 0.558, P < 0.001) and with the DistilBERT algorithm scores (negative score: ICC = 0.472, 95% CI 0.402, 0.534, P < 0.001; positive score: ICC = 0.384, 95% CI 0.302, 0.465, P < 0.001). Because the NRC lexicon was found to reflect human subjective assessments of webpages a bit better on average compared with both the Hu and Liu method and the machine-learning method, and it also has a specific emotion lexicon while requiring fewer computational resources than the machine-learning methods, we chose it for our analysis.

Given that the method we used scores entire webpages rather than the text participants actually consume, it was important to test whether the former was a good indicator of the latter. To that end we adopted two approaches. First, we examined whether there is good reliability between the valence of text on a whole webpage and the valence of text on a random part of it. To test this, we randomly extracted segments of text from webpages (n = 100) with a minimum word count of 200 words⁹⁴. We then calculated the positive and negative scores for the random samples text and that of its corresponding whole text and submitted those into an ICC analysis to calculate reliability (separately for the positive and negative scores). We observed good reliability between the NRC valence scores of randomly sampled segments and the scores of their respective webpage’s whole texts (negative score: ICC = 0.945, 95% CI 0.918, 0.963, P < 0.001; positive score: ICC = 0.947, 95% CI 0.922, 0.965, P < 0.001). This result suggests that by analysing the whole text of a webpage, we can reliably compute the sentiment of a random section of a webpage.

Second, we examined directly whether there is good reliability between the valence scores of the text of a whole webpage and the valence of the text that participants attended to the most. To test this, a new group of participants were asked to browse the internet for 10 minutes, while their eye movements were tracked via a web camera ( (Fig. 2a). This test involved 19 participants who collectively visited 59 different websites. Participants were included in the study if they visited at least one webpage that contained paragraph text. We calculated the NRC valence scores for both the text areas that captured most of the participants’ attention (highlighted in red on the heatmap generated by our algorithm in Fig. 2a) and the entirety of the text on each webpage. Our analysis included two separate mixed-effect models (using the ‘lme4’ package in RStudio). The first model predicted the positive valence of the text on the entire webpage from the positive valence of the text from areas that received the most attention. The second model predicted the negative valence of the text on the entire webpage from the negative valence of the text from areas that received the most attention. Both included fixed effects and random effects and intercepts. The results showed that both the positive valence (β = 0.304 ± 0.051 s.e., t(10.47) = 5.876, P < 0.001) and negative valence (β = 0.406 ± 0.112 s.e., t(7.89) = 3.617, P = 0.007) of the attended to text areas were strongly associated with the valence of the entire webpage text (Fig. 2b,c). Therefore, the overall valence scores of a webpage’s text can reliably reflect the valence of the sections that attract the most attention from users.

Next, we conducted two studies to test whether key parameters are different based on device used (Supplementary Analysis 2). First, we asked one group of participants (n = 25) to view webpages on their smartphones and another group (n = 25) to view the same webpages on their desktop or laptop computers. All were to provide two sentiment ratings (positive and negative) of each page on a six-point Likert scale from ‘not at all’ to ‘very much’ and compared scores across devices. Second, we asked a new group of participants (n = 28) to browse the internet for 15 minutes on their mobile phone on one day and on their desktop or laptop on another day. We then calculated the valence scores of the webpages they selected to browse on each day and compared across devices. The findings from these two studies demonstrate that the valence and affective impact of webpages do not differ based on the device used for browsing, whether it be a mobile phone or a desktop or laptop (Supplementary Analysis 2).

Assessment of mental health and mood

On day 1, before the web-browsing task, participants completed self-report questionnaires that assess psychopathology symptoms (the list is adopted from Gillan and colleagues⁶⁶) These were: Obsessive-Compulsive Inventory—Revised⁹⁵, Self-Rating Depression Scale⁹⁶, State–Trait Anxiety Inventory⁹⁷, Alcohol Use Disorder Identification Test⁹⁸, Apathy Evaluation Scale⁹⁹, Eating Attitudes Test¹⁰⁰, Barratt Impulsivity Scale¹⁰¹, Short Scales for Measuring Schizotypy¹⁰² and Liebowitz Social Anxiety Scale¹⁰³. On days 1–5, participants indicated their current mood directly before their web-browsing session and directly afterwards, on a scale from ‘very unhappy’ to ‘very happy’. Happiness is considered a key component of overall well-being¹⁰⁴ and mental health⁶². Indeed, the American Psychological Association (APA) definition of mental health states that emotional well-being, which includes happiness, is an integral part of mental health⁶². The use of a continuous scale for assessing levels of happiness is a well-established method in psychological research^105,106,107. This allowed us to test whether participants’ pre-browsing mood and post-browsing mood was related to the valence of information they browsed. The task was coded using the Qualtrics online platform (

Analysis

Assessing the stability of the valence of web-browsing across time

To assess the within-subject stability of the valence of webpages visited across the 5 days, we calculated an ICC. Specifically, we submitted separately the negative and positive valence score and the scores for the specific emotions of webpages visited by each participant for each of the 5 days into ICC analysis.

Relating the valence of webpages to mental health

Each participant was scored on the three psychopathology dimensions identified by Gillan and colleagues⁶⁶ and replicated by Rouault and colleagues⁶⁷ (‘anxious-depression’, ‘social-withdrawal’ and ‘compulsive-behaviour and intrusive thought’). To generate these scores, we followed Kelly and Sharot¹⁰ and Seow and Gillan⁶⁸—we first Z-scored the ratings for each questionnaire item separately across participants. Next, we multiplied each Z-scored item by its factor weight as identified earlier⁶⁶. Then for each subject the three psychopathology dimension scores were calculated by summing all the weighted items assigned to each dimension.

For each participant, we calculated the positive valence and negative valence scores separately across all webpages visited on each day and then averaged the daily scores across the 5 days to create a positive valence score and negative valence score, respectively. We also quantified separately the percentage of anger, fear, anticipation, trust, surprise, sadness, joy and disgust associated words ≥0.75, as defined by the NRC Emotion Lexicon¹⁰⁸, of all words on each webpage visited by participants for each day and then across days (emotion scores; see Supplementary Analysis 1 for details).

We then related the psychopathology dimensions scores to each affective score separately by submitting the three psychopathology dimension scores into a mixed ANOVA with psychopathology dimension as a within-subject factor and the valence score as within-subject modulating covariates as well as participants’ age and gender as between-subjects modulating covariates (similar to ref. ¹⁰). This analysis was followed up with a simplified analysis in which the average of the three psychopathology dimension scores for each individual was entered as a dependent measure in a linear regression with valence entered as an independent measure as well as age and gender. The data met assumptions for stated statistical tests.

For statistical analysis we used a combination of RStudio and SPSS. For text pre-processing and quantification, we used a combination of RStudio and Python (same for all studies).

Relating the valence of webpages to mood

To investigate the relationship between web-browsing patterns and mood, we asked participants to indicate their current mood directly before their web-browsing session and directly afterwards, on a slider scale from ‘very unhappy’ to ‘very happy’. Utilizing a continuous scale is a common method for evaluating happiness levels^105,106,107. We first assessed whether participants pre-browsing mood was related to the valence of information they browsed. To that end, we ran two separate mixed-effect models each including participants pre-browsing mood ratings (which we coded by converting the scale to −50 to +50) as fixed and random effects along with age and gender as fixed effect predicting the negative valence score and positive valence score of webpages visited, separately. Next, we were interested in whether the valence of the webpages that participants browsed had an impact on their mood directly after browsing the internet. To test this, we once again ran two mixed-effect models, each predicting post-browsing mood ratings (which we coded by converting the scale to −50 to +50) from either the negative and positive valence score of webpages visited (input as a fixed and random effect), controlling for pre-browsing mood (fixed and random effect) as well as age and gender (fixed effect). The data met assumptions for stated statistical tests.

Study 2: replication of study 1

Participants

Five hundred participants were recruited through the Prolific platform. Sample sizes were determined based on a power analysis relating negative score of web-browsing on day 1 from study 1 with mean psychopathology scores (G*Power⁸⁹: α = 0.05, 1 − β = 0.95). The majority of participants were recruited from the UK and US, with a subset (n = 168) recruited from any country without restriction. Participants were 18 years or older. There was no other inclusion or exclusion criteria. Data for 53 participants from whom we could not obtain at least 1 KB of text from a minimum of three webpages a day was not analysed. Thus, data for 447 participants were analysed (age 33.85 years, s.d. 12.58; 56.4% females, 41.8% males, 1.8% other). For the mood analysis, we included only those participants who submitted data that was browsed during the study session (n = 400, age 33.23 years, s.d. 11.62; 52.4% females, 47.6% males, 0% other), because otherwise their reported mood ratings would not be temporally reflective of their submitted browsing data. Participants received £7.50 for their participation.

Procedure

Study 2 replicated the methodology of study 1 with two modifications. First, we required participants to engage in a 1-day, 30-minute internet browsing session. This decision was made after a post hoc analysis of study 1, aimed at achieving a balance between statistical rigour and resource efficiency, including cost-effectiveness. Our analysis suggested that involving approximately 500 participants for this single-day study could lead to substantial cost reductions. In addition, we slightly extended the data collection time frame to gather adequate data per participant. Second, in contrast to study 1, psychopathology questionnaires were administered post web-browsing to show that the sequence of tasks in study 1 did not influence its outcomes. Mood ratings were still recorded before and after the session.

Analysis

Relating the valence score of webpages to psychopathology

This analysis was conducted as described in study 1.

Relating the valence of webpages to mood

We first tested whether participants pre-browsing mood was related to the valence of information they browsed. Because we had only one observation per participant for each variable of interest (compared with five observations in study 1), we ran two simple linear regressions predicting the negative valence score and positive valence score, separately, from pre-browsing mood ratings, controlling for age and gender. Next, we were interested in whether the valence of the webpages that participants browsed had an impact on their mood directly after browsing the internet. To test this, we ran two simple linear regressions, both predicting participants post-browsing mood ratings from either the negative or positive valence score of webpages visited. Both models controlled for participants pre-browsing mood ratings, age and gender. The data met assumptions for stated statistical tests.

Study 3

Participants

One hundred and thirty-nine participants completed the study on Qualtrics (www.qualtrics.com) and were recruited via Prolific’s online recruitment platform (www.prolific.co). Sample sizes were determined based on a pilot study to achieve a power of 0.95 (α = 0.05), using G*Power⁸⁹. Participants received £7.50 per hour for their participation. Participants were recruited from the UK and US and were 18 years or older. There were no other inclusion or exclusion criteria. Thirty-seven participants were excluded for not providing at least three webpages from which we could extract at least 1 KB of data, leaving 102 participants (negative valence condition: n = 55, age 33.96 years, s.d. 9.68; 45.5% females, 49.1% males, 5.5% other; control condition: n = 47, age 34.72 years, s.d. 12.14; 46.8% females, 51.1% males, 2.1% other). The two conditions were run within 35 minutes of each other. We did not actively randomize participants across conditions, rather the recruitment advertisement for the conditions was identical for both conditions. This means the participants were not aware which condition they were signing up for, nor were they aware that there were two conditions, ensuring no differences in demographics across groups (Supplementary Table 1). Stimuli presentation was random.

Procedure

Data collection

Here, we assess the directionality of the relationship between mood and web-browsing patterns. Participants were asked to browse two webpages randomly selected from a pool of either six very negative or six neutral webpages (all selected from pages participants browsed in study 1 or 2). The negative webpages were identified based on a negative score of >2.5 s.d. from the mean of webpages browsed in studies 1 and 2, whereas the neutral webpages had a negative score ranging between −1 to +1 s.d. from this mean. The selection criteria for the valence of the webpages were consistent with the methodology outlined in study 1. Participants’ happiness levels were measured on a scale, from ‘very unhappy’ to ‘very happy’, both before and after the webpage manipulation. Utilizing a continuous scale is a widely used method for evaluating happiness levels^105,106,107.

Next, participants were asked to browse the internet for 10 minutes using Mozilla Firefox and then submit their internet search history for this period. We then extracted the paragraph text from each webpage, denoted by <p> in the webpage’s html code, using the ‘rvest’ package in RStudio. All consecutive duplicate webpages were removed from analysis.

Analysis

To assess whether the mood manipulation was successful, we conducted a linear regression predicting participants post-browsing mood ratings from the condition variable (0, neutral condition; 1, negative condition), controlling for pre-browsing mood, age and gender. Next, for each participant, we computed the negative valence score of the webpages browsed. Finally, we conducted an independent sample t-test to investigate whether there is a difference in the negative valence score of the webpages browsed between the conditions. The data met assumptions for stated statistical tests.

Study 4a

Participants

One hundred and nine participants (label condition: n = 55, age 36.94 years, s.d. 13.68; 67.7% females, 33.3% males, 0% other; no-label condition: n = 54, age 36.09 years, s.d. 9.97; 49.1% females, 47.3% males, 3.6% other) completed the study on Qualtrics (www.qualtrics.com) and were recruited via Prolific’s online recruitment platform (www.prolific.co). Sample sizes were determined based on a pilot study to achieve a power of 0.95 (α = 0.05), using G*Power⁸⁹. Participants were recruited from the UK and US and were 18 years or older. There were no other inclusion or exclusion criteria. Participants received £7.50 per hour for their participation. Twenty-three participants were recruited 1 day before the main data collection to assure the negative mood manipulation was working, otherwise the two conditions were run within 2 hours of each other.

We did not actively randomize participants across conditions, rather the recruitment advertisement for the conditions was identical for both conditions. This means the participants were not aware which condition they were signing up for, nor were they aware there were two conditions, ensuring no differences in demographics across groups (Supplementary Table 1). Stimuli presentation was random.

Procedure

Data collection

Participants were presented with three trials, each including a different Google search results page containing one search query (randomly selected from a pool of 18 queries from Google’s list of frequent queries) along three real Google search results for that specific query. For each query, the three search results were selected such that one led to a webpage for which the text had a positive valence score (that is >2.5 s.d. from the mean positive scores of webpages browsed in studies 1 and 2), one a negative score (>2.5 s.d. from the mean negative scores of webpages browsed in studies 1 and 2), and one a neutral score (<2.5 s.d. from the mean of positive and negative scores of webpages browsed in studies 1 and 2). The presentation order of webpages was randomized to control for order effects.

On each trial participants were to select one of the three search results offered for that specific query (we stress that all three results were regarding the same topic because they were actual result options for the same query) and then spent 90 seconds browsing the selected webpage. Participants were notified that they would be asked a question about the content they browsed, to ensure adherence to the task.

Participants were assigned to the no-label condition or the label condition. In the label condition participants were presented with a three-point scale next to each search result option that went from ‘feel better’ to ‘feel worse’. An arrow indicated where on the scale that webpage scored (either ‘feel better’, ‘feel worse’ or between the two). The labels indicate whether on average this website makes people feel worse or better (Fig. 6a). In the no-label condition participants were not presented with any emojis or labels next to the search results.

Analysis

To assess whether the intervention was successful, we conducted separate linear regressions each predicting either the mean number of positive, neutral and negative labels selected by participant from the condition variable (0, no-label condition; 1, label condition), controlling for age and gender. The data met assumptions for stated statistical tests.

Study 4b

Participants

Two hundred participants (age 40.8 years, s.d. 12.9; 58.0% females, 40.5% males, 1.5% other) completed the study on Qualtrics (www.qualtrics.com) and were recruited via Prolific’s online recruitment platform (www.prolific.co). Sample sizes were determined based on a pilot study to achieve a power of 0.95 (α = 0.05), using G*Power⁸⁹. Participants were recruited from the UK and US and were 18 years or older. There were no other inclusion or exclusion criteria. Participants received £9.00 per hour for their participation. Stimuli presentation was random.

Procedure

The task was exactly as in study 4b label condition, except that participants: (1) indicated their mood at the beginning of the study (baseline) and then directly after each trial on a slider scale from ‘very unhappy’ to ‘very happy’; (2) completed six trials rather than three; and (3) topics were randomly selected from 14 rather than 18 search queries and results (trials) from study 4a (because four trials were not accessible).

Analysis

First, we conducted a paired samples t-test comparing the number of times participants selected the positive label compared with the neutral and negative label. Next, a linear mixed-effect model was implemented to analyse the impact of webpage choice on mood. Webpage choice, categorized as negative (−1), neutral (0) or positive (1), served as the independent variable (fixed and random effects). The model also controlled for baseline mood, age and gender as fixed effects to account for any confounding influences. This approach allowed us to isolate the direct effect of the independent variable—webpage choice—on the mood of the participants. The data met assumptions for stated statistical tests.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link