Vous êtes sur la page 1sur 46

Data

 Roadmap  

Sco,  Thorne  
Mary  Weisse  

1/26/11   1  
Agenda  
•  Current  state  and  background  
•  Issues  and  problems  
•  Approach  
•  Next  steps  
–  issues  to  be  addressed  
–  issues  deferred  

1/26/11   2  
MIT  Data  Warehouse  -­‐  Overview  
•  Most  administraJve  informaJon  is  available  
•  Widely  used  and  relied  on  
•  Recognized  by  peer  insJtuJons  as  one  of  the  best  in  
higher  educaJon  
•  Improvements  can  make  it  more  useful    
 Examples:  
–  adding  more  data  
–  cleaning  up  data  
–  real  Jme  data  
–  true  24x7  availability  

1/26/11   3  
Why  the  Warehouse  was  Created  
•  Primary  reporJng  plaRorm  
–  provide  a  common  access  mechanism  for  all  InsJtute  wide  
administraJve  data  
–  easier  access  provided  by  a  simpler  view  of  data    
•  Support  departmental  reporJng  needs  
–  ability  to  combine  central  and  local  data  
–  reduce  rekeying  of  data  
•  Improve  data  quality  through  usage  
•  Self  service  
–  minimize  support  &  operaJonal  costs  
–  DLC's  don’t  need  to  rely  on  a  central  reporJng  group  
1/26/11   4  
What  Data  is  Available  
•  Academic  Chairs   •  Fund  data  (Resource  Development)  
•  Alumni  Biographic,  OccupaJon  &  Degree   •  Graduate  Admissions  
•  AssociaJon  of  Student  AcJviJes  Group,   •  Graduate  Awards  
Requests  and  Space   •  HR  (Employees,  PosiJons,  &  Appointments)  
•  Benefits     Cyborg  &  SAP  
•  Budgets  (Nimbus)   •  IAP  Credit  &  Non-­‐Credit  AcJviJes  
•  CIP  (ClassificaJon  of  InstrucJonal  Programs)   •  Instructor  
•  Concur  Travel   •  InsJtuJonal  Research  -­‐  Honors  &  Awards  
•  Coop  Textbook   •  Kerberos  
•  Course  Catalog   •  Keys  
•  Cybersource  MIT  Storefront  transacJons   •  Libraries  (Financials,  Orders,  CirculaJons,  
•  Department  Master  Hierarchy  (and  other   CollecJon  Management,  Serials  Control  
hierarchies)   Processing)      
•  EHS  Contact,  Room  set,  Sarah,  Training  Data   •  Metadata  (Table,  view,  &  field  definiJons,  
•  FaciliJes/Plant  Maintenance  Orders   data  models;  report  templates)  
•  Fidelity  –  Benefits   •  MIT  Card  –  Pictures  
•  Financial   •  MIT  ID  
•  Moira  List  data  
•  Financial  Aid  
1/26/11   5  
What  Data  is  Available  (cont)  
•  OSP  Proposals  and  Award  Cost  Sharing   •  Telephone  Detail  (ice  9)  
•  Overhead  Rates   •  Student  Term  Enrollment  
•  Payroll  (DeducJons,  DistribuJons  &   •  Undergrad  Admissions  
CerJficaJons)   •  VacaJon/Absences/Leaves    
•  Pension   •  Web  StaJsJcal  Data  
•  Person  
•  Pre-­‐registraJon  &  RegistraJon  
•  Procurement    (Purchasing,  RequisiJon  &  
Credit  Card)  
•  Request  Tracker    
•  Roles  /AuthorizaJon  
•  Space  
•  Student  Biographic  
•  Student  Degree  
•  Subject  Enrollment  
•  Student  Grades  
•  Training  &  Events  
•  TLO  
1/26/11   6  
Context  Diagram  

Source  Systems   Warehouse  Environment   User  Access  

Web  
ReporJng  
Extract  
Transform   Tool  
&  Load   DBMS   SQL  
Files  
Brio  
User  
Pull  
Files   Metadata  

Extract  

1/26/11   7  
Issues  and  Problem  CollecJon  
•  Who  
–  Data  Warehouse  Community  Survey  
–  Task  Force  Reports  
–  ReporJng  Tool  Review  Commi,ee  
–  DW  Staff  
–  Interview  data  users  
•  InsJtuJonal,  School,  Department  
•  Central,  Research,  Academic  
•  What  
–  Document  
•  specific  issues    
•  brief  analysis  
–  IdenJfy  next  steps  
1/26/11   8  
Categories  of  Data  Issues  
•  Warehouse  infrastructure  &  operaJons  
•  ReporJng  environment  
•  Missing  data  
•  Data  access  policy  
•  Strategic  reporJng  (hard  to  answer  quesJons)  

1/26/11   9  
Warehouse  Infrastructure  &  OperaJons    
Issues  idenJfied  by  the  Warehouse  Staff  
•  Job  scheduling  
–  start  Jme  manually  set  
–  extra  Jme  used  to  pad  
–  hard  to  maintain    
•  7x24  availability  
•  Congested  nightly  loads  
–  scheduling  
–  full/incremental  
–  a  few  intensive  jobs  

1/26/11   10  
Warehouse  Infrastructure  &  OperaJons    
•  Performance  of  large  detailed  tables  as  history  grows  
•  Managing  the  daily  operaJon  
–  reduce  the  number  of  non  standard  processes  
•  Incomplete  DocumentaJon    
•  Lack  of  Real  Jme  data  feeds  
•  Lack  a  process  to  build  and  deliver  higher  level  
informaJon  

1/26/11   11  
ReporJng  Environment  Issues  
•  Data  definiJons  need  improvement  
•  Need  intuiJve  table  and  field  names  
•  Report  descripJons  aren't  sufficient    
–  hard  to  find  the  one  you  need    
–  report  names  should  be  in  the  footer,  so  once  you  have  
paper  you  can  find  the  report  used  to  produce  it  
–  organize  reports  be,er  
–  easy  to  idenJfy  reports  that  need  to  be  run  by  DLC  monthly  
•  Easier  way  to  get  the  data  from  the  Warehouse  and  
format  in  Excel    
•  Easy  way  to  join  data  from  mulJple  domains  
1/26/11   12  
Data  Missing  from  Warehouse  
•  OSP  
–  Pre-­‐Proposal  
–  Award  Budgets  
•  Property  
•  TLO  –  Basic  
•  Alumni  
•  ISO  informaJon  
•  Travel  charges  that  haven't  been  submi,ed  on  a  trip  report    
•  Travel  Guest  Card  charges  not  available  for  DLC's  
•  Historical  Request  Tracker  InformaJon  
•  Profit  Center  Group  History  
•  Building  Data  –  Date  Built  and  Date  Renovated  
•  Space  Survey  Data  
•  Building  DepreciaJon  
1/26/11   13  
 Data  Access  Policy  Issues  
•  Room  usage  type  
•  Appointment  supplements  
•  HR  informaJon  for  people  affiliated  with  labs  &  
centers  
–  paid  but  not  appointed  
–  faculty  not  charged  or  appointed    

1/26/11   14  
Strategic  ReporJng  (Hard  to  answer  quesJons)    
•  What  does  it  cost  MIT  to  deliver  its  programs?    
–  Undergraduate  EducaJon,  Graduate  EducaJon  and  Research  
–  How  do  costs  vary  by  school  and  program?  
–  How  do  educaJon  costs  relate  to  tuiJon  charged  (sJcker  
price)  and  tuiJon  collected  (net  tuiJon  revenue)?  
–  How  do  research  costs  relate  to  research  revenues  collected  
(direct  and  indirect)?  
•  What  does  it  cost  MIT  to  own  and  maintain  its  space?          
–  capital  (interest,    depreciaJon)  
–  operaJonal    (uJliJes,  repair  and  maintenance  costs)  
–  How  do  costs  vary  by  building  and  relate  to  market  rates?    

1/26/11   15  
Strategic  ReporJng  (Hard  to  answer  quesJons)    
•  How  much  money  has  MIT  received  from  all  sources  
(gios,  research  revenues,  other)    in  support  of  major  
areas  such  as  Cancer  Research  or  Energy  compared  to  
total  resources  used  and  needed?    
•  What  are  MIT’s  gross  expenses  by  funcJon  before  
internal  charges?        

1/26/11   16  
State  of  Data  PracJces  
•  Current  sooware  project  development  does  not  
always  include  
–  reporJng  requirements  
–  conceptual  data  model  
–  data  definiJons  
•  Projects  ooen  focus  on  a  transacJonal  orientaJon  
–  not  what  ways  the  informaJon  might  be  used  in  the  future  

1/26/11   17  
ReporJng  Commi,ee  Findings  
•  IBM/COGNOS  Tool  Suite  meets  user  requirements  for  
funcJonality  and  look  &  feel.  
–  Handling  of  local  data  
–  Web-­‐based  soluJon  
•  Data  issues  will  not  be  resolved  by  a  tool  
•  Rollout  of  the  tool  will  be  a  mulJyear  effort  
•  Data  issues  should  be  addressed  as  the  tool  is  rolled  
out  
•  Requires  the  involvement  of  the  Business  Owners  and  
Community  

1/26/11   18  
Approach  
•  Demonstrate  to  the  community  that  we  are  working  
on  all  aspects  of  the  problem  
•  Create  a  repeatable  process  that  can  applied  per  
subject  
•  Gain  experience  with  the  new  tool  
•  Begin  engaging  Business  Owners  
•  Partner  with  Development  Projects  
•  Make  demonstrable  incremental  progress    

1/26/11   19  
Areas  of  Focus  
•  Improving  ReporJng  Infrastructure  
•  Improving  Development  PracJces  
•  Improving  Data  PracJces  

1/26/11   20  
Categories  of  Work  
•  Day  to  day  operaJons  
–  upgrading  the  environment  
–  monitoring  the  loads  
–  supporJng  minor  changes  
•  Housekeeping  
–  Metadata  maintenance  
–  Warehouse  documentaJon  
•  New  reporJng  tool  administraJon  
•  Building  reports  
•  Strategic  quesJons  
•  Modeling  
•  New  informaJon  in  Warehouse  
1/26/11   21  
Resources    
•  Team  
–  2  Data  Administrators  
–  2  Developers  
–  2  ReporJng  Consultants  
•  Project  Teams  
•  Business  Owners  

1/26/11   22  
Next  Steps  2011  
•  Improving  ReporJng  Infrastructure  
–  migraJon  to  VM  
–  develop  plan  for  24x7  
–  reporJng  tool  proof  of  concept  
•  Improving  Development  PracJces  
–  iniJal  data  modeling  support  
•  Improving  Data  PracJces  
–  document  problems  &  issues  
–  Improve  metadata  documentaJon  

1/26/11   23  
Issues  to  be  Addressed  
•  Tested  in  reporJng  tool  proof  of  concept  
–  be,er  Excel  integraJon  
–  easier  report  delivery    
•  automaJc  noJficaJon  and  delivery  
–  be,er  report  descripJons  
–  be,er  report  organizaJon  
–  simpler  views  of  informaJon  
–  simpler  joining  of  informaJon  
•  Short  term  soluJon  for  Singapore  access  

1/26/11   24  
Work  Deferred  (Post  FY2011)  
•  Improving  ReporJng  Infrastructure  
–  real  Jme  data  
–  24x7  implementaJon  
–  Job  scheduling  
•  Improving  Development  PracJces  
–  to  be  determined  
•  Improving  Data  PracJces  
–  strategic  (Hard  to  answer  quesJons)  
–  define  and  publish  guidelines  
–  define  and  publish  roles  &  responsibiliJes  
–  document  data  access  policies  
•  New  data  in  Warehouse  
–  should  be  done  with  new  tool  

1/26/11   25  
Improving  Understanding  of  Data  
•  Simpler  more  focused  views  of  informaJon  

•  Be,er  definiJons  and  documentaJon  

•  ReporJng  tool  proof  of  concept  


–  including  process  for  developing  specialized  views  

1/26/11   26  
Improving  IS&T  PracJces  
•  Focus  with  the  business  on  up  front  design  
–  conceptual  data  model  
–  business  process  models  
–  reporJng  requirements  
•  Metadata  documentaJon  
•  Document  reconciliaJon  processes  

1/26/11   27  
Improving  Data  PracJces  
•  Publish  principles  &  guidelines  for  data  management  
•  Establish  processes  for  data  issue  resoluJon  and  
prioriJzaJon  
•  Gain  consensus  and  document  roles  &  responsibiliJes  
concerning  data  
•  Develop  a  method  to  monitor  progress  

1/26/11   28  
TIMELINE  
July  2011  
Improving  Warehouse  Environment   Job  Scheduling   7x24  
Migrate  to  Linux/VM   ParJJoning  
29  

Improving  the  Understanding  of  Data  


ReporJng  Tool  Pilot   Rollout  New  Tool  per  Area  
Improving  Meta  Data   Data  DocumentaJon  per  Area  
Space  Example   Define  Focused  Views  per  Area  

Improving  IS&T  PracJces  


Establishing  Modeling  ConvenJons  

Improving  Data  PracJces  


Defining  Roles  &  ResponsibiliJes  
1/26/11  
Appendix  
•  Warehouse  background  
–  design  principles  
–  Warehouse  processing  
–  staJsJcs  
•  Example  of  Addressing  Issues  in  a  single  domain  
–  Issues  
–  short  term  plan  
–  long  term  plan  
•  Issues  found  so  far  grouped  by  subject  

1/26/11   30  
Data  Warehouse  Guiding  Principles  
•  InformaJon  in  a  single  place  
–  access  any  informaJon  in  the  same  manner  
–  integrated  informaJon  
•  join  informaJon  from  several  sources  
•  Structures  should  support  the  needs  of  all  types  of  users  
–  Central,  DLC,  InsJtuJonal  
•  Open  access  via  SQL  
–  not  proprietary  only  access  
–  uncommon  in  educaJon  and  industry  
–  can  be  accessed  programmaJcally  
•  Warehouse  is  the  data  distribuJon  hub  
–  easy  to  verify  when  feeds  and  reports  are  from  a  single  
source  
1/26/11   31  
Data  Warehouse  Guiding  Principles  (CONT)  
•  Data  should  be  stable  
–  consistency  between  reports  generated  on  the  same  day  
–  ability  to  generate  the  same  report  at  any  point  in  Jme  
•  Data  has  to  be  accurate    
–  stale  data  is  be,er  than  inaccurate  data  
•  Structures  should  make  it  easy  to  report  
•  Data  is  not  corrected  or  modified  in  the  Warehouse  
•  Access  rules  are  shared  by  the  Warehouse  and  
transacJonal  systems  
–  access  managed  by  Business  Owners  

1/26/11   32  
Data  Warehouse  Guiding  Principles  (CONT)  
•  Data  always  comes  from  a  system  of  record  
•  Users  should  be  able  generate  and  share  reports  
•  Metadata  is  available  for  all  informaJon  
•  Access  control  done  at  the  database  level  
–  enables  direct  access  via  any  means  
•  Metadata  driven  transformaJons  and  loads  
–  to  know  how  data  was  arrived  at    
–  same  code  called  thousands  of  Jmes  every  night  

1/26/11   33  
Data  Warehouse  Process  Stages  
Metadata  Driven  
•  Extract  
–  minimal  logic  
–  full  or  Incremental  files  generated  
•  Transfer  
–  usually  pushed  automaJcally  
–  encrypted/decrypted  
–  simple  integrity  checking  done  
•  Convert  
–  data  into  reporJng  format  and  structures  
•  star  schemas  
•  Load  

1/26/11   34  
Metadata  

1/26/11   35  
Integrity  Checking  in  all  Processing  
•  Correct  files  on  hand  before  job  runs  
•  Record  &  byte  counts  
•  Comparisons  of  control  file  to  data  file  
•  Conversion  type  checking,  number  and  types  of  fields  
•  DBMS  constraints    
•  Error  checking  and  validaJon  rouJnes  
•  Mail  sent  to  Warehouse  &  Business  Owners  

1/26/11   36  
System  Availability  
•  Approaches  24  x  7  
–  except  Saturdays  from  3:00  pm  –  3:10  pm    
•  snapshot  
–  some  tables  are  temporarily  unavailable  in  the  early  morning  
as  new  day’s  data  is  loaded  

1/26/11   37  
StaJsJcs  
Jobs  per  night  ~300      
Input  files  per  night  875      
Field  transformaJons  12,847    
DisJnct  funcJons    78      

Tables  1,522      
Views  1,150      
Records  1,904,948,496    
Total  database  size  575g      

1/26/11   38  
StaJsJcs  
Fiscal  Year  2011  Quarter  2    (October1  2010  –  December  31,2010)  

•  664  disJnct  users  logged  into  the  Data  Warehouse  

•  RepresenJng  117  org  units  

•  GeneraJng  782,177  sessions  

1/26/11   39  
StaJsJcs  
DW  User  Community:  Sessions  by  Department   Environment,  Health  &  Safety  Office  

(400+  Sessions)  FY  2011  Q2   HQ,  InformaJon  Services  &  Technology  
492   451  
547  
730   483   441   Others  (research  affiliates  and  scienJsts,  postdoctoral  
615  
808   753   fellows,  visiJng  faculty/students/scholars,  non-­‐current  
840  
employees,  etc.)  
Center  for  InternaJonal  Studies  
965   902  
1151  
Human  Resources  
1299  

20428   Research  Laboratory  of  Electronics  


1496  

1752   Office  of  Sponsored  Programs  

2092  
Sloan  School  of  Management  

2201  
InformaJon  Services  &  Technology  

3167   Microsystems  Technology  Laboratories  

3248   GL  Ops  &  ReporJng  &  Journal  Vouchers  

10490  
Budget,  Finance  and  Treasury  
5413  

Sponsored  &  Genl  A/R  &  Cashier  


1/26/11   40  
Data  Issues  Example  
(Space)  
•  Buildings  need  to  be  mapped  to  their  cost  collectors  manually  
•  Building  costs  are  collected  via  secondary  posJngs  
–  many  of  these  are  done  manually  requiring  significant  effort  
•  UJlity  usage  metadata  not  available  in  the  Warehouse  
•  Flaws  in  the  accuracy  due  to  the  fact  that  some  uJlity  meters  
cover  more  than  one  building  
•  The  older  organizaJon  unit  idenJfiers  (6  digit)  make  it  hard  to  
relate  space  informaJon  to  other  domains  
•  All  systems  do  not  use  a  standard  format  when  referencing  
locaJon  informaJon  making  integraJon  difficult    
•  Building  built  and  last  renovated  dates  are  not  shown  in  the  
Warehouse  
•  Access  to  space  informaJon  in  not  consistent  or  documented  
•  Building  depreciaJon  informaJon  not  in  the  Warehouse  
1/26/11   41  
Space  Data  Short  Term  Plan  
•  Use  new  HR  org  unit  numbers  in  the  feeds  and  views  
of  the  Warehouse  
•  Add  the  cost  collector  that  relates  to  the  building  in  
the  building  feed    
–  include  the  cost  collector  key  in  the  Warehouse  view  
•  Add  new  fields  to  the  building  feeds  and  views  for  date  
built  and  date  renovated  
•   Add  geocode  fields  to  the  room  feed  and  views  
•   Update  the  Warehouse  metadata    
–  add  informaJon  about  uJlity  usage  

1/26/11   42  
Space  Long  Term  Plan  
•  Work  with  FaciliJes  and  others  to  agree  on  guidelines  
for  integraJng  locaJon  informaJon    
–  building  and  room  idenJfiers  
–  geocoding    
•  guidelines  for  tracking  geocode  for  things  like  equipment  
•  Work  to  gain  consensus  and  document  access  policies  
concerning  space  informaJon  
•  Ensure  that  space  survey  data  it  in  the  Warehouse  
•  Encourage  reporJng  of  space  informaJon  from  the  
Warehouse  where  appropriate    
–  possible  pilot  web  reporJng  
•  Bring  building  depreciaJon  informaJon  into  the  
Warehouse    
–  create  an  integrate  view  of  all  space  costs  in  the  Warehouse  
1/26/11   43  
HR  Data  Issues  
•  Faculty  promoJon  and  Tenure  dates  have  to  be  cleaned  up  by  hand    
–  SAP  &  business  policy  does  not  support  correcJon  of  some  informaJon  
•  CIP  (Academic  Discipline)  a,ached  to  All  Academic  Staff  
•  Race  &  Ethnicity  not  consistent  over  Jme    
–  self  idenJfied  ethnicity  needs  to  be  managed  and  synchronized  with  InsJtuJonally  idenJfied  
•  Historical  data  is  inaccurate  because  old  records  are  changes  without  the  modified  date  and  
effecJve  dates  properly  maintained  
•  Many  places  where  more  than  one  affiliaJon  occurs  but  always  need  one  marked  "Primary"  to  
support  non-­‐duplicated  counts  
•  HR  data  not  accurate.    
–  How  many  researchers  have  PHD?  
•  Many  things  could  be  calculated  in  a  standard  way  to  make  it  easier  and  reduce  discrepancies.    
–  For  example:    years  at  MIT,  could  be  provided  in  addiJon  to  the  appointment  dates.  
•  There  seem  to  be  different  tallies  for  the  Annual  Head  Count  Report  (October  31st)  which  cause  
confusion.    
•  Inaccuracies  in  Faculty  appointment  dates  
•  Inaccuracies  in  Faculty  Leave  informaJon  
•  Ethnicity  &  Race  not  maintained  accurately  
•  How  many  FTE  by  FY  for  a  DLC  for  the  last  10  years  
1/26/11   44  
Finance  Data  Issues  
•  Hard  to  maintain  CEMIT  Groups    
–  can't  tell  if  they're  used    
•  so  hesitant  to  delete  them  
–  Need  to  verify  that  they  are  all  correct  each  month    
•  maybe  an  exclusion  list  would  help  (to  see  if  the  right  gl_accounts  are  excluded  and  only  those)  
•  mistakes  lead  to  reporJng  errors  
•  Difficult  to  separate  out  internal  charging  and  transfers    
–  some  specific  GL_Accounts  for  internal  that  can  be  excluded  for  this  view  
–  could  use  SI  transacJons  instead  of  SA,  but  no  way  to  ensure  people  do  that  correctly  
–  no  way  to  enforce  integrity  rule  in  SAP  
•  Many  Document  Types  not  used,  but  not  filtered  out  of  list  
•  Sponsor  Billing  done  at  Parent  WBS  level,  so  children  look  in  deficit    
–  no  common  definiJon  of  what  "deficit"  accounts  are  
•  GL_Account_Report  is  confusing.    
•  Travel  charges  that  haven't  been  submi,ed  on  a  trip  report  can't  be  reported  on    
•  Guest  Card  charges  not  available  for  DLCs  
•  Sponsor  Billing  done  at  Parent  WBS  level,  so  children  look  in  deficit  No  common  definiJon  of  what  "deficit"  
accounts  are    
–  billing  lag,  means  DLCs  and  Central  might  have  different  views  and  definiJons  
•  Hard  to  see  if  changes  in  a  Person's  status  from  ON/OFF  Campus  have  been  made  before  the  posJngs  
•  Need  Profit  Center  Group  History  so  that  we  can  recreate  old  reports  and  understand  the  differences  in  
numbers  

1/26/11   45  
Student  System  Issues  
•  What  is  the  Cost  of  EducaJon?    
–  By  Department  and  Program    
•  per  Student  
–  How  much  TuiJon  is  Charged?    
•  What  is  waived?  
–  How  much  of  TuiJon  is  paid?    
•  by  the  Student  
•  by  a  Sponsor  
•  Don't  have  accurate  info  on  Grads.    
–  Where  did  they  get  their  undergrad  degree?  
–  When  did  a  student  enter  in  which  degree  program?  
–  They  have  Grad  entry  term,  but  not  cohort.  
–  Are  they  fully  supported  (external  sources  hard)?  

1/26/11   46  

Vous aimerez peut-être aussi