The Warren Center for Network and Data Sciences
We are delighted that a new Center is being launched to focus on topics closely aligned with the NETS Program. The Warren Center for Network and Data Sciences will include the faculty from the NETS Program as well as from across Penn, all focused on network science and data science topics. http://www.warrencenter.upenn.edu/
ASPEN - cluster processing of heterogeneous dynamic data
Managing Heterogeneity in Highly Distributed Stream, Cloud, and Sensor Systems
With the advent of low-cost wireless sensing devices, it is predicted that the world will quickly move to one in which many environments are instrumented for reasons of security, scientific monitoring, environmental control, entertainment, etc. There are many fundamental questions about how to develop applications in this emerging sensor network world. Perhaps the most important are how to support rich, complex applications that may have confidentiality requirements, heterogeneous types of sensors, different connectivity levels, and timing constraints. The Aspen (Abstract Sensor Programming Environment) project focuses on the challenges in developing a programming environment and runtime system for this style of environment.
We are investigating a number of complementary topics and ideas:
- Complex analysis in a cluster/cloud setting: Many sensor and stream data items need complex analysis. Building upon ideas from MapReduce and from our ORCHESTRA distributed query engine, we are developing new techniques for supporting cluster computation with incremental updates over recursive operations (e.g., PageRank, optimization, ...).
- Distributed coordination and control: Many complex computations need to be continuously rebalanced, redistributed, and replanned based on monitored activity -- this is a form of adaptive processing. We are developing new declarative techniques to address these problems.
- New programming model: group-based programming: We are building upon a declarative style of programming to develop a new language, group-based programming, for complex sensor applications. The goal is to combine compositional, database-style declarative computation with constraints on timing, security, distribution, and actuation in a seamless way. This work is funded by NSF CNS-0721541.
- Security and privacy: We have studied how sensor network application security is affected by node-levelcompromise. We are developing further language constructs for specifying encryption levels and other properties for data along certain channels.
- Runtime monitoring and checking: We seek to develop techniques for monitoring performance and triggering events in response to constraint violations. This work is funded by NSF CNS-0721541.
- Home health care and hospital applications: We hope to develop a number of applications useful in home hospice and hospital care, which monitor patients, and also connect patients with the care they need. This work is funded by NSF CNS-0721541.
- Declarative information integration and query optimization: The core programming model is based on database query languages. We are developing techniques for supporting schema mappings over streams,distributed in-network join computation, and recursive queries for regions. Importantly, we are developing techniques for performing distributed, decentralized optimization of such computations. This work is funded by NSF IIS-0713267.
- Stream algorithms: In a distributed setting, many nodes have limited resources and must use approximate algorithms to make decisions and capture synopses of system activity. This work is funded by NSF IIS-0713267.
- Interfacing to Java code: Many real control systems require Java, C, or other procedural code for sophisticatd sensor data processing or decision-making. This work is funded by Lockheed Martin.
- Declarative monitoring and re-optimization: We seek to build a declarative infrastructure for monitoring distributed query execution status, plus adaptive re-optimization, using declarative techniques. This work is funded by Lockheed Martin.
Query-driven data integration
The Q Query System
One of the major challenges for end users today (whether scientists, researchers, policymakers, etc.) is how to pose integrative queries over Web data sources. Today one can fairly easily input a keyword search into Google (Bing, Yahoo, etc.) and receive satisfactory answers if one's information need matches that of someone in the past: the search engine will point at a Web page containing the already-assembled content.
The challenge lies when an information discovery query is being posed -- one that requires assembly of content from multiple data items, but has not previously been posed. The Q System attempts to provide an intuitive means of posing such queries.
In Q, the user first defines a web query form to answer queries related to a specific topic or topic domain: this is done by describing (using keywords) the set of concepts that need to be interrelated. The system finds sets of data sources related to each of the topics. Then, using automatic schema matching algorithms, it finds ways of combining source data items to return results.
Putting Differential Privacy to Work
A wealth of data about individuals is constantly accumulating in various databases in the form of medical records, social network graphs, mobility traces in cellular networks, search logs, and movie ratings, to name only a few. There are many valuable uses for such datasets, but it is difficult to realize these uses while protecting privacy. Even when data collectors try to protect the privacy of their customers by releasing anonymized or aggregated data, this data often reveals much more information than intended. To reliably prevent such privacy violations, we need to replace the current ad-hoc solutions with a principled data release mechanism that offers strong, provable privacy guarantees. Recent research on differential privacy has brought us a big step closer to achieving this goal. Differential privacy allows us to reason formally about what an adversary could learn from released data, while avoiding the need for many assumptions (e.g. about what an adversary might already know), the failure of which have been the cause of privacy violations in the past. However, despite its great promise, differential privacy is still rarely used in practice. Proving that a given computation can be performed in a differentially private way requires substantial manual effort by experts in the field, which prevents it from scaling in practice.
This project aims to put differential privacy to work---to build a system that supports differentially private data analysis, can be used by the average programmer, and is general enough to be used in a wide variety of applications. Such a system could be used pervasively and make strong privacy guarantees a standard feature wherever sensitive data is being released or analyzed. The long-term goal is to combine ideas from differential privacy, programming languages, and distributed systems to make data analysis techniques with strong, provable privacy guarantees practical for general use.
Secure network provenance
Operators of distributed systems often find themselves needing to answer a diagnostic or forensic question. Some part of the system is found to be in an unexpected state; for example, a suspicious routing table entry is discovered, or a proxy cache is found to contain an unusually large number of advertisements. The operators must determine the causes of this state before they can decide on an appropriate response. On the one hand, there may be an innocent explanation: the routing table entry could be the result of a misconfiguration, and the cache entries could have appeared due to a workload change. On the other hand, the unexpected state may be the symptom of an ongoing attack: the routing table entry could be the result of route hijacking, and the cache entries could be a side-effect of a malware infection. In this situation, it would be helpful to be able to ask the system to "explain" its own state, e.g., by describing a chain of events that link the state to its root causes, such as external inputs.
As long as the system is working correctly, emerging network provenance techniques can construct such explanations. However, if some of the nodes are faulty or have been compromised by an adversary, the situation is complicated by the fact that the adversary can cause the nodes under his control to lie, suppress information, tamper with existing data, or report nonexistent events. This can cause the provenance system to turn from an advantage into a liability: its answers may cause operators to stop investigating an ongoing attack because everything looks fine.
The goal of this project is to provide secure network provenance, that is, the ability to correctly explain system states even when (and especially when) the system is faulty or under attack. Towards this goal, we are substantially extending and generalizing the concept of network provenance by adding capabilities needed in a forensic setting, we are developing techniques for securely storing provenance without trusted components, and we are designing methods for efficiently querying secure provenance. We are evaluating our techniques in the context of concrete applications, such as Hadoop MapReduce or BGP interdomain routing.
Next-generation network science
Office of Naval Research: Multi-University Research Initiative
Net-centric technology promises unprecedented levels of performance, robustness, and efficiency, yet there remains substantial confusion regarding obstacles to achieving this vision. Our proposed program will clarify and address the central research challenges: understanding network structure and function, domain-specific drivers and constraints, and the crucial issue of network architecture. Networks must shift from only sensing and communication to real-time, dynamic decision and control, with algorithms that are primarily local yet achieve provably global results, all while avoiding the rare but catastrophic real-world failures that are hard to anticipate with simulation-based methods. Our team consists of experts in various aspects of physical, informational, control, social and cognitive networks, who have played major leadership roles in creating the beginnings of an international research community specifically developing a framework and methodology to pose and answer questions directly relevant to this project.
Accountability in distributed systems
Evidence in Federated Distributed Systems
There is an increasing trend towards federated distributed systems, i.e., systems that are operated jointly by multiple different organizations or individuals. The interests of the participants in such a system are often highly diverse and/or in conflict with one another; for example, participants may be business competitors or based in hostile nations. Thus, federated systems are inherently vulnerable to insider attacks: the participants can try to subvert the system, exploit it for their own benefit, or attack other participants.
However, the participants in a federated system are typically connected in the 'offline world' as well, e.g., through social networks or business relationships. This context can be leveraged to handle misbehavior through well-known, time-tested techniques like accountability and transparency. For example, if one participant can detect and prove that another participant has misbehaved, she can sue that participant for breach of contract.
The goal of this project is to develop a key technology for enabling this approach, namely a reliable and general way to generate and verify evidence of misbehavior in federated systems. We study the fundamental tradeoffs, requirements, and inherent costs of creating evidence, we develop new algorithms for efficiently supporting different kinds of evidence, and we evaluate these algorithms in the context of practical systems.
Safety on Untrusted Network Devices
The goal of the SOUND (Safety On Untrusted Network Devices) project is to design a distributed system that can offer cloud-style services but is highly resilient to cyber-attacks. Rather than focusing on specific known attacks, we would like to provide resiliency against a broad range of known and unknown (Byzantine) attacks; for instance, an adversary could compromise a certain number of nodes and modify them in some arbitrary way. Our goal is to detect and mitigate such attacks whenever possible, e.g., by reconfiguring the system to exclude any compromised nodes.
We approach this problem using the principle of mutual suspicion: Nodes continually monitor each other and check for unusual actions or changes in behavior that could be related to an attack. However, since we are assuming a very strong adversary, the bar for a successful solution is high: We require a strong, provable guarantee that the adversary cannot circumvent the system, as well as a practical design that can efficiently provide this guarantee. We expect that the SOUND project will build on results from the CRASH/SAFE effort at the level of individual nodes; however, SOUND goes beyond CRASH/SAFE by considering an entire distributed system with a heterogeneous mix of nodes, many of which may not be operating in a secure environment.