Enhance Privacy Policy Communication with Automated Information Extraction
Problem Summary
The complexity of privacy policies often makes it difficult for users to understand data practices and their implications. Traditional privacy policies are usually lengthy, dense, and filled with legal jargon, leading to a lack of user engagement and comprehension.
Rationale
The rationale is that improving the clarity and comprehension of privacy policies for users by simplifying and structuring these policies can help users understand the information more easily, make informed decisions, and reduce the cognitive burden associated with lengthy and complex policies.
Solution
The automation or development of tools that improve the querying and readability of privacy policies, enhancing users' understanding, clarity, and awareness. These tools primarily focus on annotating privacy policies to facilitate subsequent processing. Given the complexity of automating this annotation process, they employ natural language processing (NLP) and machine learning (ML) techniques for automation. The OPP-115 dataset is predominantly utilised for training the proposed models in the supporting research presented below.
Harkous et al. [1] proposed Polisis, a scalable tool for privacy policy analysis that breaks privacy policies into segments and annotates each for detailed data practices. This allows for both high-level and fine-grained queries.
Polisis consists of three layers: (a) Application Layer - provides information queried by users; (b) Data Layer - responsible for scraping the policy webpage, extracting, and segmenting the privacy policy; and (c) Machine Learning Layer - utilises word embeddings and neural networks for text classification, trained on the OPP-115 dataset to detect fine-grained labels of privacy policy segments.
The prototype browser extensions for Chrome and Firefox and the prototype webpage (https://pribot.org/) are no longer operational.
Brunotte et al. [2] proposed the Online Privacy Policy eXplainer (PriX), a web browser extension designed to help users understand privacy policies through visual explanations. PriX performs three key functions: checking if a website has a privacy policy, analysing the privacy policy, and providing visual explanations to enhance user comprehension.
The tool uses trained classifiers (Naive Bayes and Random Forest) to identify data categories and present them with corresponding privacy icons, facilitating better understanding. The classifiers were trained on the OPP-115.
Shayegh and Ghanavati [3] proposed a method for extracting notice and choice statements from privacy policies.
Privacy policies are annotated with five key labels: #D (definition), #A (action), #F (fact), #NR (non-relevant), and #CR (cross-references). Actions are further detailed with sub-annotations such as #Collect, #Purpose, #Shared, #Permit, and #Information.
These annotations are used to create graphs from the annotated action sections, establishing connections between different tags. The graphs are then transformed into concise notices and choices regarding data practices.
This method aims to generate short, understandable notices and choices, improving the readability and usability of privacy policies for users, particularly in IoT devices.
Alabduljabbar et al. [4] developed TLDR, a segment-highlighting tool that condenses essential segments in privacy policies related to specific privacy practices. TLDR employs machine learning and deep learning models to extract privacy and data collection practices. The process involves segment preprocessing, extracting deep features from text representation techniques, and using an ensemble of classifiers for multi-label classification. The classifiers were trained using the OPP-115 dataset to detect fine-grained labels of privacy policy segments. Although the work does not include screenshots of the segments' presentation, it provides detailed performance metrics for the classifiers and machine learning algorithms used in TLDR's construction.
Windl et al. [5] developed PrivacyInjector, a web browser extension for Google Chrome and Firefox, designed to enhance user privacy awareness by providing contextual privacy policy information. PrivacyInjector operates in five steps: identifying the URL of the site's privacy policy, segmenting and annotating the policy, recognising relevant contexts, and displaying annotated segments as icon bubbles on the webpage. The tool uses domain-specific word embeddings created from unlabelled data and trains text classifiers using convolutional neural networks (CNNs). The segment classification process involves preprocessing, extracting deep features from text representation techniques, and utilising an ensemble of classifiers for multi-label classification. PrivacyInjector leverages the MAPS Policies Dataset to form a corpus of privacy-related words and uses fastText for training word embeddings. The contextual privacy policy (CPP) design includes collaboration with design experts, resulting in concise information displayed in sidebars with draggable bubbles appearing in relevant website areas. The extension's client and server-side components are available on GitHub, allowing for further research and development. The PrivacyInjector authors have also made the trained classifiers available in the repository https://github.com/Maxikilliane/polisis-classifiers, which uses results from the work of Harkous et al. [1].
Chang et al. [6] propose a system that builds a user privacy concern profile using crowdsourced data and interviews. These profiles are grouped using hierarchical clustering to create a system that matches new users to a profile cluster.
The system employs Convolutional Neural Networks (CNN) and Random Forest models to analyse privacy policies, considering the user's privacy concern profile and related GDPR items. The OPP-115 dataset is used for training and privacy categories, achieving a precision of 0.94 for privacy category classification and an accuracy of 0.81 for policy segment extraction. Although the paper does not provide a link to the new privacy concern profiles dataset or screenshots of the implemented Android app, it demonstrates the system's effectiveness in improving user privacy awareness.
Pontes, Zorzo and Mello [7] developed PPMark, a prototype tool designed to process privacy policies written in natural language and extract information about data collection and usage, presenting this information in a label format similar to nutrition facts. PPMark aims to make privacy policies more understandable by displaying key data collection practices in a user-friendly manner.
Platforms: personal computers, mobile devices
Related guidelines: Implement Visual Strategies for Effective Communication of Lengthy Privacy Policies, Incorporate Icons to Improve Privacy Policy Communication
Example
The user interface of Privacy Policy eXplainer (PriX) [2] from Wayback Machine. (See enlarged)
Generated notice example [4]. (See enlarged)
Overview of PrivacyInjector as in Windl et al. [5]. In the example, a user navigated to the homepage of Webex. PrivacyInjector identified segments related to cookies and tracking elements in the lengthy policy document (c). An information icon (a) appears on the cookie banner, and when selected, a sidebar (b) reveals the extracted info snippets. (See enlarged)
Use cases
- Simplifying complex legal language and structuring privacy policies to enhance user comprehension and readability.
- Automatically generating concise notices and choices from longer privacy policies to help users quickly understand key data practices and make informed decisions.
- Assisting users in querying privacy policies to address specific privacy concerns, providing clear and relevant information about data handling practices.
- Helping websites and organisations comply with data protection regulations (e.g., GDPR) by making privacy policies more transparent, clear, and understandable for users.
Pros
- A user study showed that visual explanations are an appropriate way to foster privacy awareness and can help users understand privacy policies [2].
- TLDR streamlines reading, slashing the average time by 39.14% through condensing information, simplifying the process by reducing paragraphs and words, and lessening the effort required to grasp the service provider's practices [4].
- PrivacyInjector effectively recognised privacy policies across various websites, earning high functionality ratings from participants and encouraging more thoughtful privacy behaviours. Users found the tool increasingly useful over time, noting that it did not disrupt their browsing experience but provided valuable information beyond the initial exploration phase, particularly during registration/login or when surprising information was revealed [5].
- A real-world study of the proposed solution showcased its effectiveness, accurately providing users with concerning content at a 0.81 accuracy rate [6].
- Extracting notice and choice statements improves the clarity and precision of privacy policy summaries [3]. Also, PPMark's label format presentation of privacy policy information is user-friendly, making complex privacy terms more understandable, and users felt it reduced the time required to read privacy policies [7].
- Polisis allows detailed queries on data practices, significantly improving users' ability to understand complex privacy policies through machine learning-driven analysis [1].
Cons
- Needs long term study to test user habituation [2][5].
- Service providers need to improve machine accessibility (machine reading) of privacy policies [6].
- Currently, it only identifies and analyses privacy policies in English, even on multi-language websites, displaying the English version in the context of any language version [5]. Only English is also the case cited in [2] and [3].
- Unusual website interactions, unclear policy URLs, mandatory user interactions, and missing policy links make it challenging to analyse and communicate policy segments systematically [5]. Additionally, the lack of a standard format in privacy policies leads to ambiguity [4]. The PPMark tool, for example, relies on structured input formats and may struggle with unstructured or poorly formatted privacy policies, limiting its effectiveness [7].
- The technological difficulty regarding explanations about privacy, in general, is to ensure that the texts continue to be legally compliant [2].
- The proposed method of annotating privacy policies is done manually, so it is necessary to build a tool to automate the steps [3]. Additionally, the performance of machine learning models in accurately classifying and annotating segments of privacy policies may vary depending on the quality and representativeness of the training data [1].
Privacy Notices
Such solutions aim to communicate personal data handling practices through privacy notices. It can also be integrated with privacy choices [8], enabling users to make immediate decisions, which researchers find more effective. Considering the design space for privacy notices [9], this guideline can be applied to the following dimensions:
- On demand
The proposed guideline, aside from navigating the privacy policy itself, can also be utilised to present a privacy notice to users when they actively seek privacy information, such as in privacy dashboards and settings interfaces.
- At Setup
The proposed guideline can be used to present a privacy notice to users when they are using the system for the first time so they can be aware of the data handling practices. It can be integrated with privacy choices, requiring users to make decisions or give consent based on the information in the notice.
- Blocking
This guideline can be paired with blocking controls (privacy choices), requiring users to make decisions or give consent based on the information in the notice.
- Decoupled
This guideline can be applied to privacy notices decoupled from privacy choices.
- Non-blocking
This guideline can be coupled with non-blocking controls (privacy choices), providing control options without forcing user interaction.
- Visual
This guideline is for a visual notice, using visual resources such as colours, text and icons.
- Machine-readable
The suggested solutions could be applied to a machine-readable format, but currently, the lack of a standard format for privacy policies limits this potential.
- Primary
This guideline can be applied to the same platform or device the user is interacting with.
- Secondary
This guideline can be applied to secondary channels if the primary channel does not have an interface or has a limited one.
- Public
This guideline could be applied to public notices. However, public channels may be limited in how much information they can convey, and if privacy choices are necessary, other supporting channels are necessary.
Transparency
Transparency [10] is the main privacy attribute since this mechanism involves the proactive distribution of information to users, promoting visually accessible communication of data handling practices, and helping users to make privacy-informed decisions. Other related privacy attributes:
Providing users with comprehensive and comprehensible insights into data handling practices leverages control by allowing users to make self-determined decisions about the sharing of their personal data.
This guideline can also improve the understanding of the purpose of data handling practices by enhancing querying and readability.
Esta diretriz também pode melhorar a compreensão das práticas de coleta de dados pessoais, ao aprimorar a consulta e a legibilidade das políticas de privacidade.
References
[1] Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G. Shin, and Karl Aberer. Polisis: Automated analysis and presentation of privacy policies using deep learning. In 27th USENIX Security Symposium (USENIX Security 18), 2018, pp. 531-548 https://www.usenix.org/system/files/conference/usenixsecurity18/sec18-harkous.pdf
[2] Wasja Brunotte, Larissa Chazette, Lukas Kohler, Jil Klunder, and Kurt Schneider. What About My Privacy?Helping Users Understand Online Privacy Policies. In Proceedings of the International Conference on Software and System Processes and International Conference on Global Software Engineering (ICSSP'22). Association for Computing Machinery, New York, NY, USA, 2022, 56–65. https://doi.org/10.1145/3529320.3529327
[3] Parvaneh Shayegh and Sepideh Ghanavati. Toward an Approach to Privacy Notices in IoT. 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW), Lisbon, Portugal, 2017, pp. 104-110. https://doi.org/10.1109/REW.2017.77
[4] Abdulrahman Alabduljabbar, Ahmed Abusnaina, Ülkü Meteriz-Yildiran, and David Mohaisen. TLDR: Deep Learning-Based Automated Privacy Policy Annotation with Key Policy Highlights. In Proceedings of the 20th Workshop on Workshop on Privacy in the Electronic Society (WPES '21). Association for Computing Machinery, New York, NY, USA, 2021, 103–118. https://doi.org/10.1145/3463676.3485608
[5] Maximiliane Windl, Niels Henze, Albrecht Schmidt, and Sebastian S. Feger. Automating Contextual Privacy Policies: Design and Evaluation of a Production Tool for Digital Consumer Privacy Awareness. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22). Association for Computing Machinery, New York, NY, USA, 2022, Article 34, 1–18 https://doi.org/10.1145/3491102.3517688
[6] Cheng Chang, Huaxin Li, Yichi Zhang, Suguo Du, Hui Cao, and Zhu Haogin. Automated and Personalized Privacy Policy Extraction Under GDPR Consideration. In: Biagioni, E., Zheng, Y., Cheng, S. (eds) Wireless Algorithms, Systems, and Applications. WASA 2019. Lecture Notes in Computer Science(), vol 11604. Springer, Cham https://doi.org/10.1007/978-3-030-23597-0_4
[7] Diego Roberto Gonçalves Pontes, Sergio Donizetti Zorzo, and Jose Santiago Moreira de Mello (2017). Evaluation of the reliability of using the prototype PPMark - a tool to support the computer human interaction in readings the privacy policies - using the GQM and TAM models. AMCIS 2017 Proceedings. 22. https://aisel.aisnet.org/amcis2017/InformationSystems/Presentations/22
[8] Florian Schaub, Rebecca Balebako, Adam L Durity, and Lorrie Faith Cranor (2015). A Design Space for Effective Privacy Notices. In: Symposium on Usable Privacy and Security (SOUPS 2015). [S.l.: s.n.], p. 1–17. https://www.usenix.org/system/files/conference/soups2015/soups15-paper-schaub.pdf
[9] Yuanyuan Feng, Yaxing Yao, and Norman Sadeh (2021). A Design Space for Privacy Choices: Towards Meaningful Privacy Control in the Internet of Things. In CHI Conference on Human Factors in Computing Systems (CHI ’21), May 8–13, 2021, Yokohama, Japan. ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3411764.3445148
[10] Susanne Barth, Dan Ionita, and Pieter Hartel (2022). Understanding Online Privacy — A Systematic Review of Privacy Visualizations and Privacy by Design Guidelines. ACM Comput. Surv. 55, 3, Article 63 (February 2022), 37 pages. https://doi.org/10.1145/3502288